到目前為止疚膊,我們已經(jīng)研究了梯度下降算法劝篷、人工神經(jīng)網(wǎng)絡(luò)以及反向傳播算法,他們各自肩負(fù)重任:
- 梯度下降算法:機(jī)器自學(xué)習(xí)的算法框架吨灭;
- 人工神經(jīng)網(wǎng)絡(luò):“萬能函數(shù)”的形式表達(dá)沛厨;
- 反向傳播算法:計算人工神經(jīng)網(wǎng)絡(luò)梯度下降的高效方法宙地;
基于它們,我們已經(jīng)具備了構(gòu)建具有相當(dāng)實用性的智能程序的核心知識逆皮。它們來之不易宅粥,從上世紀(jì)40年代人工神經(jīng)元問世,到80年代末反向傳播算法被重新應(yīng)用电谣,歷經(jīng)了近半個世紀(jì)秽梅。然而抹蚀,實現(xiàn)它們并進(jìn)行復(fù)雜的手寫體數(shù)字識別任務(wù),只需要74行Python代碼(忽略空行和注釋)企垦。要知道如果采用編程的方法(非學(xué)習(xí)的方式)來挑戰(zhàn)這個任務(wù)环壤,是相當(dāng)艱難的。
本篇將分析這份Python代碼“network.py”钞诡,它基于NumPy郑现,在對50000張圖像學(xué)習(xí)后,即能夠識別0~9手寫體數(shù)字荧降,正確率達(dá)到95%以上接箫。強(qiáng)烈建議暫時忘記TF,用心感受凝結(jié)了人類文明結(jié)晶的滄桑算法朵诫。代碼來自Micheal Nielsen的《Neural Networks and Deep Learning》辛友,略有修改(格式或環(huán)境匹配),文末有下載鏈接拗窃。
MNIST數(shù)據(jù)集
早在1998年,在AT&T貝爾實驗室的Yann LeCun就開始使用人工神經(jīng)網(wǎng)絡(luò)挑戰(zhàn)手寫體數(shù)字識別泌辫,用于解決當(dāng)時銀行支票以及郵局信件郵編自動識別的需求随夸。數(shù)據(jù)集MNIST由此產(chǎn)生。它包含從0~9共10種手寫體數(shù)字震放,訓(xùn)練圖片集60000張宾毒,測試圖片集10000張,可在Yann LeCun的網(wǎng)站下載殿遂。
MNIST最初來源于NIST(National Institute of Standards and Technology诈铛,美國國家標(biāo)準(zhǔn)與技術(shù)研究院)數(shù)據(jù)庫,后經(jīng)過預(yù)處理成為更適合機(jī)器學(xué)習(xí)算法使用的MNIST墨礁,首字母M是“修改過的”(Modified)的意思幢竹。到目前為止,它仍是機(jī)器學(xué)習(xí)算法實驗使用最廣泛的基準(zhǔn)數(shù)據(jù)集恩静,就像生物學(xué)家經(jīng)常用果蠅做實驗一樣焕毫,Geoffrey Hinton將其形容為“機(jī)器學(xué)習(xí)的果蠅”。而手寫體數(shù)字識別驶乾,也成了機(jī)器學(xué)習(xí)的入門實驗案例邑飒。
如上圖所示,MNIST中的圖像是灰度圖像级乐,像素值為0的表示白色疙咸,為1的表示黑色,中間值是各種灰色风科。每張樣本圖像的大小是28x28撒轮,具有784個像素乞旦。
訓(xùn)練集與測試集
MNIST中的60000張訓(xùn)練圖像掃描自250個人的手寫樣本,他們一半是美國人口普查局的員工腔召,一半是大學(xué)生杆查。10000張測試圖像來自另外250個人(盡管也是出自美國人口普查局和高校)⊥沃耄可是為什么要這么做呢亲桦?答案是為了泛化(Generalization)。
人們希望學(xué)習(xí)訓(xùn)練集(training set)后獲得的模型浊仆,能夠識別出從未見過的樣本客峭,這種能力就是泛化能力,通俗的說抡柿,就是舉一反三舔琅。人類大腦就具有相當(dāng)好的泛化能力,一個兩歲小孩在見過少量的鴨子圖片后洲劣,即可辨認(rèn)出他從未見過的各種形態(tài)的鴨子备蚓。
基于這種考慮,測試集(test set)不會參于模型的訓(xùn)練囱稽,而是特意被留出以測試模型的泛化性能郊尝。周志華的西瓜書中有一個比方:如果讓學(xué)生復(fù)習(xí)的題目,就是考試的考題战惊,那么即便他們考了100分流昏,也不能保證他們真的學(xué)會了。
標(biāo)簽
上圖中右側(cè)的部分吞获,稱為標(biāo)簽(Label)况凉,是和樣本數(shù)據(jù)中的每張圖片一一對應(yīng)的,由人工進(jìn)行標(biāo)注各拷。標(biāo)簽是數(shù)據(jù)集必不可少的一部分刁绒。模型的訓(xùn)練過程,就是不斷的使識別結(jié)果趨近于標(biāo)簽的過程烤黍√哦В基于標(biāo)簽的學(xué)習(xí),稱為有監(jiān)督學(xué)習(xí)蚊荣。
驗證集與超參數(shù)
來自Micheal Nielsen的代碼初狰,又把60000張訓(xùn)練集進(jìn)行了進(jìn)一步的劃分,其中50000張作為訓(xùn)練集互例,10000張作為驗證集(validation set)奢入。所以代碼使用MNIST數(shù)據(jù)集與Yann LeCun的是有些區(qū)別的,本篇使用的MNIST從這里下載。
模型的參數(shù)是由訓(xùn)練數(shù)據(jù)自動調(diào)整的腥光,其他不被學(xué)習(xí)算法覆蓋的參數(shù)关顷,比如神經(jīng)網(wǎng)絡(luò)中的學(xué)習(xí)率、隨機(jī)梯度下降算法中的mini batch的大小等武福,它們都被稱為超參數(shù)议双。驗證集被劃分出來就是用于評估模型的泛化能力,并以此為依據(jù)優(yōu)化超參數(shù)的捉片。
這里容易產(chǎn)生一個疑問:評估模型的泛化能力平痰,不是測試集要做的事情嗎?
測試集的確是用于評估模型的泛化能力的伍纫,但是理想情況下是用于最終評測宗雇。也就是說,測試集產(chǎn)生的任何結(jié)果和反饋莹规,都不應(yīng)該用于改善模型赔蒲,以避免模型對測試集產(chǎn)生過擬合。那么從訓(xùn)練集劃分出驗證集良漱,就沒有這個限制了舞虱,一方面驗證集不參與訓(xùn)練,可以評估模型的泛化能力母市,另一方面矾兜,可以從評估的結(jié)果來進(jìn)一步改善模型的網(wǎng)絡(luò)架構(gòu)、超參數(shù)窒篱。
驗證數(shù)據(jù)不是MNIST規(guī)范的一部分焕刮,但是留出驗證數(shù)據(jù)已經(jīng)成了一種默認(rèn)的做法舶沿。
Python必知必會:張量構(gòu)建
本篇使用與《Neural Networks and Deep Learning》示例代碼一致的Python版本:
- Python 2.7.x墙杯,使用了conda創(chuàng)建了專用虛擬環(huán)境,具體方法參考1 Hello, TensorFlow!括荡;
- NumPy版本:1.13.1高镐。
作為AI時代頭牌語言的Python,具有非常好的生態(tài)環(huán)境畸冲,其中數(shù)值算法庫NumPy做矩陣操作嫉髓、集合操作,基本都是“一刀斃命”邑闲。為了能順暢的分析接下來的Python代碼算行,我挑選了1處代碼重點(diǎn)看下,略作修改(前兩層神經(jīng)元數(shù)量)可以單獨(dú)運(yùn)行苫耸。
1 import numpy as np
2 sizes = [8, 15, 10]
3 biases = [np.random.randn(y, 1) for y in sizes[1:]]
4 weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
第1行:導(dǎo)入numpy并啟用np作為別名州邢。
第2行:是一個數(shù)組定義,其中包含了3個元素褪子。
第3行:
- 先看
sizes[1:]
量淌,它表示sizes的一個子數(shù)組骗村,包含元素從原數(shù)組的下標(biāo)1開始,直到原數(shù)組最后1個元素呀枢,它的值可以算出是[15, 10]
; - 然后是NumPy的隨機(jī)數(shù)生成方法
random.randn
胚股,它生成的隨機(jī)數(shù),符合均值為0裙秋,標(biāo)準(zhǔn)差為1的標(biāo)準(zhǔn)正態(tài)分布琅拌; -
random.randn
方法的參數(shù),描述生成張量的形狀残吩,例如random.randn(2,3)
會生成秩為2财忽,形狀為shape[2,3]的張量,是一個矩陣:array([[-2.17399771, 0.20546498, -1.2405749 ], [-0.36701965, 0.12564214, 0.10203605]])
泣侮,關(guān)于張量請參考2 TensorFlow內(nèi)核基礎(chǔ)即彪; - 第3行整體來看才能感受到Python和NumPy的威力:方法參數(shù)的參數(shù)化,即調(diào)用
randn
方法時可傳入變量:randn(y, 1)
活尊,而變量y遍歷集合sizes[1:]
隶校,效果等同于[randn(15, 1), randn(10, 1)]
;
第4行:
- 先看
sizes[:-1]
表示其包含的元素從原數(shù)組的第1個開始蛹锰,直到原數(shù)組的最后1個的前一個(倒數(shù)第2個)深胳,此時sizes[:-1]
是[8, 15]
; - 第4行
randn
的兩個參數(shù)都是變量y和x铜犬,此時出現(xiàn)的zip
方法舞终,限制了兩個變量是同步自增的,效果等同于[randn(15, 8), randn(10, 15)]
癣猾。
矩陣與神經(jīng)網(wǎng)絡(luò)
分析了前面4行代碼敛劝,我們知道了如何高效的定義矩陣,但是和神經(jīng)網(wǎng)絡(luò)的構(gòu)建有什么關(guān)系呢纷宇?下面給出網(wǎng)絡(luò)結(jié)構(gòu)與矩陣結(jié)構(gòu)的對應(yīng)關(guān)系夸盟。
上面的神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)即可描述為:sizes = [8, 15, 10]
,第一層輸入層8個神經(jīng)元像捶,第二層隱藏層15個神經(jīng)元上陕,第三層輸出層10個神經(jīng)元。
第一層是輸入層拓春,沒有權(quán)重和偏置释簿。
第二層的權(quán)重和偏置為:
第三層的權(quán)重和偏置為:
回看第3行代碼,其等價于[randn(15, 1), randn(10, 1)]
硼莽,相當(dāng)于把網(wǎng)絡(luò)中的兩層偏置矩陣放在一起了:
3 biases = [np.random.randn(y, 1) for y in sizes[1:]]
回看第4行代碼庶溶,其等價于[randn(15, 8), randn(10, 15)]
,相當(dāng)于把網(wǎng)絡(luò)中的兩層權(quán)重矩陣放在一起了:
4 weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
而這4個矩陣本身,就代表了想要構(gòu)建的神經(jīng)網(wǎng)絡(luò)模型渐尿,它們中的元素醉途,構(gòu)成了神經(jīng)網(wǎng)絡(luò)的所有可學(xué)習(xí)參數(shù)(不包括超參數(shù))。當(dāng)明了了神經(jīng)網(wǎng)絡(luò)與矩陣群的映射關(guān)系砖茸,在你的腦中即可想象出數(shù)據(jù)在網(wǎng)絡(luò)中的層層流動隘擎,直到最后的輸出的形態(tài)。
隨機(jī)梯度下降算法框架
整個神經(jīng)網(wǎng)絡(luò)程序的骨架凉夯,就是梯度下降算法本身货葬,在network.py中,它被抽象成了一個單獨(dú)的函數(shù)SDG(Stochastic Gradient Descent):
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None)
函數(shù)體的實現(xiàn)劲够,非常清晰震桶,有兩層循環(huán)組成。外層是數(shù)據(jù)集的迭代(epoch)征绎;內(nèi)層是隨機(jī)梯度下降算法中小批量集合的迭代蹲姐,每個批量(batch)都會計算一次梯度,進(jìn)行一次全體參數(shù)的更新(一次更新就是一個step):
for j in range(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k + mini_batch_size]
for k in range(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
BP
可以想象self.update_mini_batch(mini_batch, eta)
中的主要任務(wù)就是獲得每個參數(shù)的偏導(dǎo)數(shù)人柿,然后進(jìn)行更新柴墩,求取偏導(dǎo)數(shù)的代碼即:
delta_nabla_b,delta_nabla_w = self.backprop(x, y)
反向傳播算法(BP)的實現(xiàn)也封裝成了函數(shù)backprop
:
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation) + b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in range(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose())
return (nabla_b, nabla_w)
識別率
運(yùn)行代碼凫岖,在Python命令行輸入以下代碼:
import mnist_loader
import network
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
net = network.Network([784, 30, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
上面代碼中的mnist_loader負(fù)責(zé)MNIST數(shù)據(jù)的讀取江咳,這部分代碼在這里下載,為了適配數(shù)據(jù)集的相對路徑做了微調(diào)哥放。
接下來歼指,定義了一個3層的神經(jīng)網(wǎng)絡(luò):
- 輸入層784個神經(jīng)元(對應(yīng)28x28的數(shù)字手寫體圖像);
- 隱藏層30個神經(jīng)元甥雕;
- 輸出層10個神經(jīng)元(對應(yīng)10個手寫體數(shù)字)踩身。
最后是梯度下降法的設(shè)置:
- epoch:30次;
- batch:10個樣本圖像犀农;
- 學(xué)習(xí)率:3.0惰赋。
代碼開始運(yùn)行宰掉,30次迭代學(xué)習(xí)后呵哨,識別準(zhǔn)確率即可達(dá)到95%。這個識別率是未去逐個優(yōu)化超參數(shù)轨奄,就能輕松得到的孟害,可以把它當(dāng)做一個基線水準(zhǔn),在此基礎(chǔ)上再去慢慢接近NN的極限(99.6%以上)挪拟。
運(yùn)行結(jié)果如下:
附完整代碼
"""
network.py
~~~~~~~~~~
A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network. Gradients are calculated
using backpropagation. Note that I have focused on making the code
simple, easily readable, and easily modifiable. It is not optimized,
and omits many desirable features.
"""
# Libraries
# Standard library
import random
# Third-party libraries
import numpy as np
class Network(object):
def __init__(self, sizes):
"""The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a) + b)
return a
def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data:
n_test = len(test_data)
n = len(training_data)
for j in range(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k + mini_batch_size]
for k in range(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print("Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test))
else:
print("Epoch {0} complete".format(j))
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w - (eta / len(mini_batch)) * nw for w, nw in zip(
self.weights, nabla_w)]
self.biases = [b - (eta / len(mini_batch)) * nb for b, nb in zip(
self.biases, nabla_b)]
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation) + b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in range(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l - 1].transpose())
return (nabla_b, nabla_w)
def evaluate(self, test_data):
"""Return the number of test inputs for which the neural
network outputs the correct result. Note that the neural
network's output is assumed to be the index of whichever
neuron in the final layer has the highest activation."""
test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations - y)
# Miscellaneous functions
def sigmoid(z):
"""The sigmoid function."""
return 1.0 / (1.0 + np.exp(-z))
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z) * (1 - sigmoid(z))
共享協(xié)議:署名-非商業(yè)性使用-禁止演繹(CC BY-NC-ND 3.0 CN)
轉(zhuǎn)載請注明:作者黑猿大叔(簡書)