下載IMDB數(shù)據(jù)集
IMDB是內(nèi)置于keras庫的電影數(shù)據(jù)庫地回,里面包含了50 000條嚴(yán)重兩極分化的評論军援。已經(jīng)過預(yù)處理:評論的單詞序列已經(jīng)被轉(zhuǎn)化為整數(shù)序列,即每個整數(shù)代表字典中的某個單詞。25 000條用于訓(xùn)練遗锣,25 000條用于測試琳省。下面對數(shù)據(jù)集通過代碼進(jìn)行分析:
- 導(dǎo)入keras數(shù)據(jù)包
from keras.datasets import imdb
- 通過imdb.load_data方法導(dǎo)入數(shù)據(jù)迎吵,其中num_words=10000表示僅保留數(shù)據(jù)中前10000個常見出現(xiàn)的單詞,低頻出現(xiàn)的單詞被舍棄针贬。方法返回類型為(train_data, train_labels), (test_data, test_labels)击费,其中train_data是訓(xùn)練用的評論數(shù)據(jù),train_labels是對應(yīng)訓(xùn)練用評論的分類(0表示負(fù)面桦他,1表示正面)蔫巩,test_data是測試的評論數(shù)據(jù),test_labels是測試的評論數(shù)據(jù)的分類快压。
(train_data, train_labels), (test_data,
test_labels) = imdb.load_data(num_words=10000)
我們用print函數(shù)將train_data和train_labels打印出來圆仔。結(jié)果如下:
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])...
[1 0 0 ... 0 1 0]
因?yàn)閿?shù)據(jù)已經(jīng)通過預(yù)處理,我們可以看到蔫劣,打印出來的數(shù)據(jù)是一組整數(shù)序列坪郭。
下面我們將數(shù)字映射到字符,看看每個整數(shù)分別代表什么單詞拦宣。
word_index = imdb.get_word_index()
reversed_word_index = dict([value, key] for (key, value) in word_index.items())
decoded_review = ' '.join(
[reversed_word_index.get(i - 3, '?') for i in train_data[0]])
imdb.get_word_index方法返回一個從單詞映射到整數(shù)索引的字典截粗。將word_index打印出來后的數(shù)據(jù):
{...., 'ev': 88575, 'chicatillo': 88576, 'transacting': 88577, "'la": 27630, 'percent': 8925, 'oprah': 7996, 'sics': 88578, 'illinois': 11925, 'dogtown': 40828, 'roars': 20595, 'branch': 9456, 'kerouac': 52002, 'wheelers': 88579, 'sica': 20596, 'lance': 6435, "pipe's": 88580, 'discretionary': 64179, 'contends': 40829, 'copywrite': 88581, 'geysers': 52003, 'artbox': 88582, 'cronyn': 52004, 'hardboiled': 52005, "voorhees'": 88583, '35mm': 16815, "'l'": 88584, 'paget': 18509, 'expands': 20597,...}
reversed_word_index 是一個字典,將word_index當(dāng)中{字符:數(shù)字}的數(shù)據(jù)格式鸵隧,轉(zhuǎn)化為{數(shù)字:字符}的數(shù)據(jù)格式绸罗。之所以做這樣的轉(zhuǎn)化,是因?yàn)槎固保覀冏x取的數(shù)據(jù)是處理好的(整數(shù)的形式)珊蟀,所以,需要以數(shù)字作為檢索項(xiàng),來檢索字符育灸。
reversed_word_index .get(i - 3, '?')腻窒,i-3是鍵值,‘?’是鍵值不存在的時(shí)候返回的值磅崭,也就是默認(rèn)標(biāo)點(diǎn)符號是問號(?)儿子。這里取了train_data[0],即訓(xùn)練數(shù)據(jù)里面的第一組數(shù)據(jù)砸喻∪岜疲’ ‘.join([char])是在字符串后面加空格并將char插入到字符串后面。我們把decoded_review打印出來:
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
這樣我們就看到了從整數(shù)反向轉(zhuǎn)化成評論的效果了割岛。當(dāng)然愉适,讀者需要注意的是,做反向的轉(zhuǎn)化癣漆,對進(jìn)行二分類而言并沒有意義维咸,我只是想通過這個過程告訴大家,整數(shù)和字符之間是一一對應(yīng)的惠爽。神經(jīng)網(wǎng)絡(luò)只能對數(shù)進(jìn)行操作癌蓖,而無法對字符進(jìn)行運(yùn)算,所以我們需要將字符處理成數(shù)字的形式婚肆。當(dāng)然费坊,字符編碼成數(shù)字的方法有很多種,后面的章節(jié)里我們還會有介紹旬痹。
將整數(shù)序列編碼為二進(jìn)制矩陣
上面一節(jié),我們把數(shù)據(jù)下載并讀取出來讨越,現(xiàn)在我們要對數(shù)據(jù)進(jìn)行處理两残。處理的目的是:將整數(shù)序列編碼為二進(jìn)制矩陣。
- 實(shí)現(xiàn)編碼函數(shù)
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
numpy.zeros(shape把跨,dtype=float人弓,order = 'C'),其中shape是矩陣形狀着逐,即矩陣是幾行幾列崔赌,dtype默認(rèn)值為float,order是數(shù)據(jù)存儲方式耸别,‘C’表示以行的形式存儲健芭,’F‘是以列的形式存儲。所以np.zeros((len(sequences), dimension))方法的意思是秀姐,以sequences的長度為矩陣行數(shù)慈迈,以10000為矩陣列數(shù),生成一個0矩陣省有。
enumerate() 函數(shù)用于將一個可遍歷的數(shù)據(jù)對象(如列表痒留、元組或字符串)組合為一個索引序列谴麦,同時(shí)列出數(shù)據(jù)和數(shù)據(jù)下標(biāo)。舉例說明
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
list(enumerate(seasons))
輸出的結(jié)果為:
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
那么我們舉例來說明
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
這兩行代碼的含義是:
假如sequences = [1,3,2,3]伸头,results是4行4列的0矩陣匾效。那么enumerate之后的數(shù)據(jù)為[(0,1), (1,3), (2,2), (3,3)],那么:
results[0,1] = 1.
results[1,3] = 1.
results[2,2] = 1.
resluts[3,3] = 1.
那么矩陣results可表示為
0 1 0 0
0 0 0 1
0 0 1 0
0 0 0 1
也就是說矩陣當(dāng)中的每一行都能夠以二進(jìn)制的形式唯一的表示sequences 當(dāng)中的一個數(shù)據(jù)恤磷,并且面哼,每一行只有一位被置位,我們把這種方法叫做ONE-HOT編碼碗殷。
因?yàn)槲覀冊谏厦嬉还?jié)中獲取數(shù)據(jù)的時(shí)候精绎,只獲取的前10000個常用詞,所以矩陣的列數(shù)不會超過10000锌妻,所以我們定義的vectorize_sequences方法代乃,dimension=10000。
- 對train_data和test_data進(jìn)行向量化編碼
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
- 對train_labels和test_labels進(jìn)行向量化
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
asarray方法可以將結(jié)構(gòu)數(shù)據(jù)轉(zhuǎn)化為ndarray(矩陣向量)仿粹。舉例說明:
data1=[[1,1,1],[1,1,1],[1,1,1]]
arr3=np.asarray(data1)
arr3的結(jié)果為:
arr3:
[[1 1 1]
[1 1 1]
[1 1 1]]
建立keras二分類模型
通過上面的處理搁吓,我們已經(jīng)把數(shù)據(jù)處理成矩陣向量的形式,接下來吭历,要進(jìn)入我們核心的內(nèi)容了:利用keras建立二分類模型:
from keras import models
from keras import layers
model = models.Sequential() #采用Sequential模型的方式
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
keras有兩種建立模型的方式堕仔,一種是采用keras內(nèi)置的Sequential模型的方式,另外一種是采用函數(shù)式API的方式晌区,兩種模式各自有其優(yōu)缺點(diǎn)摩骨,從入門和易用性的角度上講,Sequential模型使用更簡單朗若;但是函數(shù)式API的方式恼五,使用更靈活,可以根據(jù)自身需求實(shí)現(xiàn)不同的網(wǎng)絡(luò)模型哭懈,達(dá)到網(wǎng)絡(luò)最優(yōu)化的目的灾馒。一般而言,如果不是自己研究一種網(wǎng)絡(luò)模型遣总,Sequential模型的方式足夠滿足日常需求睬罗。這里我們采用的就是Sequential模型的方式。
model.add方法的作用是增加一層旭斥,layers.Dense方法的作用是增加全連接層容达,所謂全連接的概念指的是,前一層的每一個輸出垂券,都是該層任意神經(jīng)元的輸入董饰。如下圖所示
layers.Dense(16, activation='relu', input_shape=(10000, ))
其中,16表示神經(jīng)單元個數(shù),relu是激活函數(shù)卒暂,input_shape是輸入矩陣的形狀啄栓。神經(jīng)單元個數(shù),也就是該層網(wǎng)絡(luò)的輸出個數(shù)也祠。relu是激活函數(shù)昙楚,keras提供了很多的內(nèi)置激活函數(shù),詳情可參考keras官方手冊了解诈嘿。也就是說堪旧,通過本行代碼,增加一層單元個數(shù)為16奖亚,輸入為10000淳梦,激活函數(shù)為relu的全連接層。
后面兩層的增加類似昔字,只不過爆袍,除了第一層外,其他網(wǎng)絡(luò)層不涉及輸入形狀的問題作郭,因?yàn)槠漭斎霐?shù)據(jù)的個數(shù)由前一層決定陨囊。
編譯網(wǎng)絡(luò)模型
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
編譯網(wǎng)絡(luò)模型方法中,指定了模型的優(yōu)化函數(shù)夹攒、損失函數(shù)以及評估函數(shù)蜘醋。簡單的講,損失函數(shù)用來評估網(wǎng)絡(luò)計(jì)算值和實(shí)際值之間的誤差咏尝,優(yōu)化函數(shù)通過方向傳播誤差的方法優(yōu)化網(wǎng)絡(luò)的權(quán)重压语,評估函數(shù)同損失函數(shù)類似,只不過它是用來評估網(wǎng)絡(luò)性能的方法编检。一般而言无蜂,二分類問題我們選用binary_crossentropy作為損失函數(shù)。
訓(xùn)練網(wǎng)絡(luò)
model.fit(x_train, y_train, epochs=20, batch_size=512)
在訓(xùn)練網(wǎng)絡(luò)的方法中蒙谓,epochs表示訓(xùn)練的次數(shù),batch_size表示小批量隨機(jī)梯度下降方法當(dāng)中的批量值训桶,也就是每512組數(shù)據(jù)計(jì)算一次平均的隨機(jī)梯度下降值累驮。
到此為止,關(guān)于二分類問題的基礎(chǔ)代碼講解完畢舵揭,下面我們通過執(zhí)行model.predict(x_test)函數(shù)來看一下谤专,測試數(shù)據(jù)的輸出結(jié)果:
[[0.01034644]
[0.9999963 ]
[0.9942436 ]
...
[0.03826523]
[0.00732365]
[0.97382474]]
我們可以發(fā)現(xiàn),網(wǎng)絡(luò)的輸出結(jié)果實(shí)際上概率值午绳,比如第一個值0.01034644置侍,我們可以認(rèn)為這條評論大概率是負(fù)面的,第二個值0.9999963,則大概率的是正面的評論蜡坊。