typora-copy-images-to: ./
Hinton 公開課PA2
這篇文章是Hinto公開課Neural Networks for Machine Learning的筆記,因為不常用Matlab尤辱,會有很多關(guān)于Matlab的解釋戚篙。
背景介紹
作業(yè)中使用了一個250個單詞的集合表示單詞表,數(shù)據(jù)集中的每一條記錄都有4個單詞,作業(yè)的目的是使用前三個單詞作為輸入預(yù)測第4個單詞。
這個作業(yè)對應(yīng)的課件是Lec 4,以word的概率表示為出發(fā)點鉴逞,介紹了bp。比較難理解的點是word 的encode司训,以及在嵌入層中的表示[1][2]。
嵌入層的含義
在課中的小測驗中已經(jīng)透漏了為什么要將每個單詞占用一個元素的位置:
- 線性可分開
- 各個數(shù)據(jù)獨立液南,保證沒有先驗性知識存在壳猜。
Load Data和Struct
Matlab的Struct
data.mat和load.m數(shù)據(jù)相關(guān),用load data.m
后獲得一個struct
滑凉,matlab中的stuct跟C的比較類似统扳,是眾多支持的dtatype中的一種喘帚,支持的函數(shù)很多,比較有用的一個是fieldnames
獲得字段名咒钟,其他函數(shù)參考這里吹由。在octave的command line里邊打data可以獲得該類型的提示:
>> data
data =
scalar structure containing the fields:
testData =
Columns 1 through 25:
xx xx xx
使用fieldnames
函數(shù)獲得如下字段:
>> fieldnames(data)
ans =
{
[1, 1] = testData
[2, 1] = trainData
[3, 1] = validData
[4, 1] = vocab
}
有點不適應(yīng)matlab里邊什么都用矩陣索引的方式,另外朱嘴,為什么是scalar sruct倾鲫?看matlab官網(wǎng)的tutorial怎么創(chuàng)建一個struct
:
patient(1).name = 'John Doe';
patient(1).billing = 127.00;
patient(1).test = [79, 75, 73; 180, 178, 177.5; 220, 210, 205];
patient
patient = scalar struct containing the fields::
name: 'John Doe'
billing: 127
test: [3×3 double]
往array里邊添加一個struct:
patient(2).name = 'Ann Lane';
patient(2).billing = 28.50;
patient(2).test = [68, 70, 68; 118, 118, 119; 172, 170, 169];
patient
patient = 1×2 struct array with fields:
name
billing
test
此時就變成了1x2 struct array
不再是scaar
了。有一個有趣的章節(jié)是Cell vs. Struct Arrays萍嬉,cell和struct的區(qū)別是struct可以用field name索引乌昔,而cell只能用index索引∪雷罚看下面例子:
temperature(1,:) = {'2009-12-31', [45, 49, 0]};
temperature(2,:) = {'2010-04-03', [54, 68, 21]};
temperature(3,:) = {'2010-06-20', [72, 85, 53]};
temperature(4,:) = {'2010-09-15', [63, 81, 56]};
temperature(5,:) = {'2010-12-09', [38, 54, 18]};
temperature
temperature = 5×2 cell array
'2009-12-31' [1×3 double]
'2010-04-03' [1×3 double]
'2010-06-20' [1×3 double]
'2010-09-15' [1×3 double]
'2010-12-09' [1×3 double]
創(chuàng)建數(shù)組的方式可以看到仍然是(row, col)這種磕道,數(shù)組索引都是從1開始。很像是Python里邊的tuple行冰,用行列方式獲得數(shù)據(jù)溺蕉。例如:
>> temperature(:, 1)
ans =
{
[1, 1] = 2009-12-31
[2, 1] = 2010-04-03
[3, 1] = 2010-06-20
[4, 1] = 2010-09-15
[5, 1] = 2010-12-09
}
>> temperature(1, :)
ans =
{
[1,1] = 2009-12-31
[1,2] = 45 49 0
}
順便說一下,在command里邊用clear
可以清除已經(jīng)設(shè)置的變量悼做。用whos
或者class
查看變量的類型焙贷。
代碼
function [train_input, train_target, valid_input, valid_target, test_input, test_target, vocab] = load_data(N)
% This method loads the training, validation and test set.
% It also divides the training set into mini-batches.
% Inputs:
% N: Mini-batch size. 批量的大小
% Outputs:
% train_input: An array of size D X N X M, where
% D: number of input dimensions (in this case, 3).
% N: size of each mini-batch (in this case, 100).
% M: number of minibatches.
% train_target: An array of size 1 X N X M.
% valid_input: An array of size D X number of points in the validation set.
% test: An array of size D X number of points in the test set.
% vocab: Vocabulary containing index to word mapping.
load data.mat;
numdims = size(data.trainData, 1);
D = numdims - 1;
M = floor(size(data.trainData, 2) / N);
train_input = reshape(data.trainData(1:D, 1:N * M), D, N, M);
train_target = reshape(data.trainData(D + 1, 1:N * M), 1, N, M);
valid_input = data.validData(1:D, :);
valid_target = data.validData(D + 1, :);
test_input = data.testData(1:D, :);
test_target = data.testData(D + 1, :);
vocab = data.vocab;
end
size函數(shù)的使用方法:
sz = size(A) returns a row vector whose elements contain the length of the corresponding dimension of A. For example, if A is a 3-by-4 matrix, then size(A) returns the vector [3 4]. The length of sz is ndims(A).
If A is a table or timetable, then size(A) returns a two-element row vector consisting of the number of rows and the number of table variables.
szdim = size(A,dim) returns the length of dimension dim.
[m,n] = size(A) returns the number of rows and columns when A is a matrix.
[sz1,...,szN] = size(A) returns the length of each dimension of A separately.
可以看到,size返回了一個dimension的數(shù)組贿堰,可以通過參數(shù)返回某一維的數(shù)據(jù)辙芍。
(1:D, 1:N * M)
前者是說取1-D行,后面的意思是說取1-N列乘以M次羹与,也就是多少個1-N故硅。
data.trainData是一個4 x 372550的矩陣,分成了兩部分纵搁,train_input和train_target吃衅,前者是三維矩陣,后者是一維矩陣腾誉。然后通過reshape分成一個個的batch徘层。這樣N = 100, D = 3, M = 3725。
>> data.trainData(1:4, 1:10)
ans =
28 184 183 117 223 42 242 223 74 42
26 44 32 247 190 74 32 32 32 192
90 249 76 201 249 26 223 158 221 91
144 117 122 186 6 32 32 144 32 68
>> train_input(1:3, 1:10)
ans =
28 184 183 117 223 42 242 223 74 42
26 44 32 247 190 74 32 32 32 192
90 249 76 201 249 26 223 158 221 91
>> train_target(1, 1:10)
ans =
144 117 122 186 6 32 32 144 32 68
validData和testData的大小都是4 x 46568利职,比train的大小小了一個數(shù)量級趣效。可以看到猪贪,train是分批量的跷敬,valid和test是不用分批量的。輸入是列向量热押。
Train 數(shù)據(jù)集
初始化代碼和參數(shù)配置
% This function trains a neural network language model.
function [model] = train(epochs)
% Inputs:
% epochs: Number of epochs to run.
% Output:
% model: A struct containing the learned weights and biases and vocabulary.
% SET HYPERPARAMETERS HERE.
batchsize = 100; % Mini-batch size.
learning_rate = 0.1; % Learning rate; default = 0.1.
momentum = 0.9; % Momentum; default = 0.9.
numhid1 = 50; % Dimensionality of embedding space; default = 50.
numhid2 = 200; % Number of units in hidden layer; default = 200.
init_wt = 0.01; % Standard deviation of the normal distribution
% which is sampled to get the initial weights; default = 0.01
epochs表示訓(xùn)練多少個來回西傀,momentum表示使用了動量gradient descent方法斤寇。
% VARIABLES FOR TRACKING TRAINING PROGRESS.
show_training_CE_after = 100;
show_validation_CE_after = 1000;
cross entropy (CE) 表示交叉熵,每100個batch求一次平均交叉熵拥褂,每1000個batch后運行一次validation娘锁,這時求一次valid 交叉熵。
% LOAD DATA.
[train_input, train_target, valid_input, valid_target, ...
test_input, test_target, vocab] = load_data(batchsize);
[numwords, batchsize, numbatches] = size(train_input);
vocab_size = size(vocab, 2);
除了第一節(jié)的內(nèi)容饺鹃,numwords = 3, batchsize = 100, numbatches = 3725, 補充一下這里的vocab是一個行向量莫秆,一共有250個,所以vocab_size = 250.
% INITIALIZE WEIGHTS AND BIASES.
word_embedding_weights = init_wt * randn(vocab_size, numhid1);
embed_to_hid_weights = init_wt * randn(numwords * numhid1, numhid2);
hid_to_output_weights = init_wt * randn(numhid2, vocab_size);
hid_bias = zeros(numhid2, 1);
output_bias = zeros(vocab_size, 1);
word_embedding_weights_delta = zeros(vocab_size, numhid1);
word_embedding_weights_gradient = zeros(vocab_size, numhid1);
embed_to_hid_weights_delta = zeros(numwords * numhid1, numhid2);
hid_to_output_weights_delta = zeros(numhid2, vocab_size);
hid_bias_delta = zeros(numhid2, 1);
output_bias_delta = zeros(vocab_size, 1);
expansion_matrix = eye(vocab_size);
count = 0;
tiny = exp(-30);
randn
表示創(chuàng)建一個服從正態(tài)分布的隨機數(shù)尤慰,參數(shù)就是行馏锡、列,同類型的函數(shù)族:randn()
randn(n)
伟端。 zeros
的類型跟randn
完全相同杯道。eye
也類似,就是identity matrix责蝠。網(wǎng)絡(luò)圖:
![](D:%5Cdl%5Cnetwork.png)
- word_embedding_weights = [250, 50]
- embed_to_hid_weights = [3 * 50, 200]
- hid_to_output_weight = [200, 250]
- hid_bias = [200, 1]
- output_bias = [250, 1]
這個圖里邊輸入是三個單詞的索引党巾,輸出是推測出來的第四個單詞的索引,嵌入層默認有50個霜医,隱藏層默認有200個齿拂。所以輸入到嵌入層是在word_embedding_weights
里邊查找對應(yīng)的index。
Q: 這里為什么要用50個是什么意思肴敛?不是只有三個輸入嗎署海?
A: 這個網(wǎng)絡(luò)圖畫的很精簡,沒有畫出神經(jīng)元的個數(shù)医男,事實上砸狞,嵌入層有50個神經(jīng)元,與輸入層是全連**接關(guān)系镀梭。同樣的隱藏層與嵌入層也是這種關(guān)系刀森。所以輸入在每個weight矩陣里邊都有表示。
Q: 每個單詞是如何表示的报账?
A: 每個單詞一個索引研底。每個單詞都與50個神經(jīng)元有聯(lián)系。
Train
前向傳播
% TRAIN.
for epoch = 1:epochs
fprintf(1, 'Epoch %d\n', epoch);
this_chunk_CE = 0;
trainset_CE = 0;
% LOOP OVER MINI-BATCHES.
for m = 1:numbatches
input_batch = train_input(:, :, m);
target_batch = train_target(:, :, m);
% FORWARD PROPAGATE.
% Compute the state of each layer in the network given the input batch
% and all weights and biases
[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input_batch, ...
word_embedding_weights, embed_to_hid_weights, ...
hid_to_output_weights, hid_bias, output_bias);
input_batch是[3, 100]透罢,target_batch是[1, 100]榜晦。fprop的代碼中,首先計算word embedding layer的值:
[numwords, batchsize] = size(input_batch);
[vocab_size, numhid1] = size(word_embedding_weights);
numhid2 = size(embed_to_hid_weights, 2);
%% COMPUTE STATE OF WORD EMBEDDING LAYER.
% Look up the inputs word indices in the word_embedding_weights matrix.
embedding_layer_state = reshape(...
word_embedding_weights(reshape(input_batch, 1, []),:)',...
numhid1 * numwords, []);
這里用實際上做的還是矩陣乘法琐凭,但是用index縮小了計算量[1]芽隆。相當于:
inputs = zeros(numwords * batchszie, vocabsize) % [batch input Number, features number]
inputs * word_embedding_weights
最終得到的embedding_layer_state是[300, 50] -> [150, 100]的矩陣。
Q: reshape(input_batch, 1, [])中[]是什么意思统屈?
A: 首先看一下不用[]胚吁,MATLAB會報錯,看這里的api介紹 []
的作用是讓matlab自己去推導(dǎo)個數(shù)愁憔。 所以這一行意思是把3*100的input拉直成一個1 x 300的列向量腕扶。
Q: 怎么用input_batch的值查找weight矩陣?
A: word_embedding_weights是一個[250, 50]的矩陣吨掌,用[1, 300]的列向量索引使用了matlab的Assigning to Elements Outside Array Bounds 特性半抱,使用的是將input的值每個都作為行索引,最終將結(jié)果擴展成一個[300, 50]的矩陣膜宋。順便提這里matlab的matrix indexing窿侈,matlab中矩陣是按列在內(nèi)存中布局的,也就是說最好是按列計算秋茫。
Q: 還是不列舉這個過程史简,能不能舉個例子?
A: 例子如下肛著,input的作用就是選哪一行圆兵。
>> weights = magic(3)
weights =
8 1 6
3 5 7
4 9 2
>> input = [3, 2, 1, 3]' % the value should not exceed 3 or error occurs.
>> weights(input, :)
ans =
4 9 2
3 5 7
8 1 6
4 9 2
Q:例子中為什么用列向量作為input?而不是行向量
A: 實驗了一下是一樣的,列向量更有效率(memory layout)枢贿?殉农, 重新測試如下:
>> input = [3, 2]'
>> weights(input, :)
ans =
4 9 2
3 5 7
>> input = [3, 2]
ans =
4 9 2
3 5 7
具體形狀如何變化參考:Accessing Multiple Elements
-
http://spaces.ac.cn/archives/4122/ "詞向量與Embedding究竟是怎么回事?" ?
-
https://zhuanlan.zhihu.com/p/27830489 "YJango的Word Embedding--介紹" ?