從最優(yōu)化的角度來看痹换,主要有一階和二階兩類優(yōu)化方案固棚,那么從最簡單的開始:
- Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size*weightes_grad
缺點(diǎn):
- 計算量過大
- 對于非凸函數(shù)不能保證全局最優(yōu)
- SGD(Stochastic Gradient Descent)
while True:
data_batch = sample_training_data(data,256)
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weighted_grad
注意此處代碼給出的是mini-batch 的SGD,還有一種是針對每個輸入數(shù)據(jù)的SGD(batch_size=1)仁锯,該方法在優(yōu)化過程中會有比較明顯的震蕩耀找。
缺點(diǎn):
- 可能出現(xiàn)的高方差使得收斂速度慢,穩(wěn)定性差
- 需要選取合適的學(xué)習(xí)率
- 沒有解決非凸函數(shù)問題
- SGD+Momentum
vx = 0
while True:
dx = compute_gradient(x)
### 通常rho取0.9或者0.99
vx = rho * vx +dx
x += - learning_rate * vx
優(yōu)點(diǎn):
- 能夠遏制動蕩
- 可以越過局部最小點(diǎn)和鞍點(diǎn)
缺點(diǎn): - 可能因為高動量而越過最小值
- Nesterov
vx = 0
while True:
dx = compute_gradient(x)
old_v = v
v = rho * v - learning_rate * dx
x += - rho* old_v + (1+rho)* v
原本的數(shù)學(xué)公式應(yīng)當(dāng)寫成:
不過括號中的導(dǎo)數(shù)不便于計算,因此經(jīng)過化簡得到上述結(jié)果阻课。
其原理如下:為了改進(jìn)動量方法的越過最小點(diǎn)問題,需要提前看一點(diǎn)贾节。
- AdaGrad
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
- RMSprop
grad_squared = 0
while True:
dx = computer_gradient(x)
grad_squared = decay_rate * grad_squared + (1-decay_rate) * dx *dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
- Adam
first_moment = 0
second_moment = 0
for t in range(1, num_iterations):
dx = compute_gradient(x)
first_moment = beta1 * first_moment + (1-beta1) * dx
second_moment = beta2 * second_moment + (1-beta2)*dx*dx
# 為了修正一開始 first_mometn和 second_moment從0開始累積狞悲,有了以下兩項
first_unbias = first_moment / (1 - beta1 ** t)
second_unbias = second_moment / (1 - beta2 **t)
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)