導(dǎo)讀
gradient descent
momentum
RMSProp
adam
鞍點(diǎn)
gradient descent根據(jù)每次迭代時(shí)計(jì)算梯度的樣本大小痛阻,可以分為bath, mini-batch, SGD边翼;對(duì)gd的優(yōu)化须眷,可通過修改迭代步長或alpha值改進(jìn) 乌叶;優(yōu)化迭代步長的算法有:momentum, RMSProp, adam等; 修改alpha值:learning rate decay柒爸,learning rate的衰減有不同的方法,以下列舉常用的幾種
alpha = alpha_0 / (1 + decay_rate * t);
alpha = alpha_0 * 0.95 ** t;
alpha = alpha_0 * k / sqrt(t); ...
本文只討論標(biāo)準(zhǔn)gd及其對(duì)迭代步長優(yōu)化的算法事扭,假設(shè)優(yōu)化函數(shù)為:z = x ** 2 + 5y ** 2,
import matplotlib.pyplot as plt
import numpy as np
#橢圓等高線
x = y = np.linspace(-10, 10, 1000)
x,y = np.meshgrid(x,y)
z = x ** 2 + 5 * y ** 2
plt.contour(x,y,z)
plt.show()
gd算法:x := x - alpha * dx, y := y = alpha * dy
其中 dx=2x, dy = 10y捎稚,令初始點(diǎn)(x, y) = (8, 9)
plt.contour(x, y, z)
plt.scatter(0,0)
plt.xlabel("x")
plt.ylabel("y")
alpha = 0.01
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
plt.scatter(t[0], t[1])
for i in range(1, 100):
dt = np.dot(w, t)
t_step = dt
t = t - alpha * t_step
plt.scatter(t[0], t[1])
print("最終值")
print(t[0], t[1])
plt.show()
最終值
[1.2179347] [0.0002361]
alpha值的設(shè)置
def gd(alpha):
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
i = 1
iter_array = []
result_array = []
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
while i < 100:
dt = np.dot(w, t)
t_step = dt
t = t - alpha * t_step
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
i+=1
return iter_array, result_array
def get_convergence_iter(i_result_array, threshold):
i = 0
while i < len(i_result_array):
t = i_result_array[i]
if abs(t) < threshold:
return i+1
i+=1
return i+1
for i in range(1, 20):
i_new = i / 100
i_iter_array, i_result_array = gd(i_new)
print('alpha值:%.2f, 收斂次數(shù):%d' %(i_new,get_convergence_iter(i_result_array, 0.0001)))
for i in [0.01, 0.1, 0.2005]:
i_iter_array, i_result_array = gd(i)
plt.plot(i_iter_array, i_result_array, label="alpha:%.4f" %(i))
plt.legend()
plt.show()
alpha值:0.01, 收斂次數(shù):101
alpha值:0.02, 收斂次數(shù):101
alpha值:0.03, 收斂次數(shù):101
alpha值:0.04, 收斂次數(shù):83
alpha值:0.05, 收斂次數(shù):66
alpha值:0.06, 收斂次數(shù):55
alpha值:0.07, 收斂次數(shù):47
alpha值:0.08, 收斂次數(shù):41
alpha值:0.09, 收斂次數(shù):36
alpha值:0.10, 收斂次數(shù):32
alpha值:0.11, 收斂次數(shù):29
alpha值:0.12, 收斂次數(shù):26
alpha值:0.13, 收斂次數(shù):24
alpha值:0.14, 收斂次數(shù):22
alpha值:0.15, 收斂次數(shù):21
alpha值:0.16, 收斂次數(shù):19
alpha值:0.17, 收斂次數(shù):23
alpha值:0.18, 收斂次數(shù):35
alpha值:0.19, 收斂次數(shù):73
上述結(jié)果可知,隨著alpha值增加求橄,迭代次數(shù)先減小今野,后增大,最后發(fā)散罐农;alpha值需要合理設(shè)置条霜,否則優(yōu)化算法不能收斂
gd的優(yōu)化
momentum
momentum算法在梯度的基礎(chǔ)上加上指數(shù)滑動(dòng)平均;
gd算法:x := x - alpha * dx涵亏;
momentum算法:
vdx = beta * vdx + (1-beta) * dx,
x := x - alpha * vdx
采用SGD或mini-bath訓(xùn)練機(jī)器學(xué)習(xí)或神經(jīng)網(wǎng)絡(luò)模型的時(shí)候宰睡,dw受每次迭代的batch影響,造成步長的擺動(dòng)气筋,而采用滑動(dòng)平均可以減少擺動(dòng)的幅度拆内,從而加快收斂速度。
plt.contour(x, y, z)
plt.scatter(0,0)
plt.xlabel("x")
plt.ylabel("y")
alpha = 0.2005
#gd
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
dy_value = []
for i in range(1, 100):
dt = np.dot(w, t)
t_step = dt
dy_value.append(round(t_step[1][0],2))
t = t - alpha * t_step
print("gd的更新步長(y值)")
print(dy_value)
#momentum
beta = 0.9
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
plt.scatter(t[0], t[1])
vdt = 0
dy_value = []
for i in range(1, 100):
dt = np.dot(w, t)
vdt = beta * vdt + (1-beta) * dt
t_step = vdt
dy_value.append(round(t_step[1][0],2))
t = t - alpha * t_step
plt.scatter(t[0], t[1])
print("momentum的更新步長(y值)")
print(dy_value)
print("最終值")
print(t[0], t[1])
plt.show()
gd的更新步長(y值)
[80, -80.4, 80.8, -81.21, 81.61, -82.02, 82.43, -82.84, 83.26, -83.67, 84.09, -84.51, 84.93, -85.36, 85.79, -86.21, 86.65, -87.08, 87.51, -87.95, 88.39, -88.83, 89.28, -89.72, 90.17, -90.62, 91.08, -91.53, 91.99, -92.45, 92.91, -93.38, 93.84, -94.31, 94.78, -95.26, 95.73, -96.21, 96.69, -97.18, 97.66, -98.15, 98.64, -99.14, 99.63, -100.13, 100.63, -101.13, 101.64, -102.15, 102.66, -103.17, 103.69, -104.21, 104.73, -105.25, 105.78, -106.31, 106.84, -107.37, 107.91, -108.45, 108.99, -109.53, 110.08, -110.63, 111.19, -111.74, 112.3, -112.86, 113.43, -113.99, 114.56, -115.14, 115.71, -116.29, 116.87, -117.46, 118.04, -118.63, 119.23, -119.82, 120.42, -121.02, 121.63, -122.24, 122.85, -123.46, 124.08, -124.7, 125.32, -125.95, 126.58, -127.21, 127.85, -128.49, 129.13, -129.78, 130.43]
momentum的更新步長(y值)
[8.0, 13.6, 15.91, 14.8, 10.83, 5.09, -1.1, -6.45, -9.97, -11.14, -9.96, -6.9, -2.76, 1.51, 5.06, 7.23, 7.74, 6.65, 4.33, 1.38, -1.56, -3.89, -5.2, -5.35, -4.4, -2.67, -0.57, 1.43, 2.94, 3.71, 3.66, 2.89, 1.61, 0.13, -1.22, -2.19, -2.63, -2.49, -1.87, -0.94, 0.09, 1.0, 1.62, 1.85, 1.69, 1.2, 0.52, -0.19, -0.79, -1.18, -1.29, -1.13, -0.76, -0.27, 0.22, 0.62, 0.85, 0.89, 0.75, 0.47, 0.13, -0.21, -0.47, -0.61, -0.61, -0.5, -0.29, -0.04, 0.18, 0.35, 0.43, 0.42, 0.32, 0.17, 0.0, -0.15, -0.26, -0.31, -0.28, -0.21, -0.1, 0.02, 0.12, 0.19, 0.21, 0.19, 0.13, 0.05, -0.03, -0.1, -0.14, -0.15, -0.13, -0.08, -0.03, 0.03, 0.07, 0.1, 0.1]
最終值
[0.03790624] [-0.00786362]
momentum算法的效率
def mom(alpha, beta):
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
vdt = 0
i = 1
iter_array = []
result_array = []
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
while i < 100:
dt = np.dot(w, t)
vdt = beta * vdt + (1-beta) * dt
t_step = vdt
t = t - alpha * t_step
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
i += 1
return iter_array, result_array
beta = 0.9
for i in range(1, 20):
i_new = i / 10
i_iter_array, i_result_array = mom(i_new, beta)
print('alpha值:%.2f, 收斂次數(shù):%d' %(i_new,get_convergence_iter(i_result_array, 0.001)))
for i in [0.01, 0.1, 1]:
i_iter_array, i_result_array = mom(i, beta)
plt.plot(i_iter_array, i_result_array, label="alpha:%.2f" %(i))
plt.legend()
plt.show()
alpha值:0.10, 收斂次數(shù):84
alpha值:0.20, 收斂次數(shù):101
alpha值:0.30, 收斂次數(shù):98
alpha值:0.40, 收斂次數(shù):84
alpha值:0.50, 收斂次數(shù):101
alpha值:0.60, 收斂次數(shù):95
alpha值:0.70, 收斂次數(shù):101
alpha值:0.80, 收斂次數(shù):101
alpha值:0.90, 收斂次數(shù):98
alpha值:1.00, 收斂次數(shù):101
alpha值:1.10, 收斂次數(shù):96
alpha值:1.20, 收斂次數(shù):101
alpha值:1.30, 收斂次數(shù):100
alpha值:1.40, 收斂次數(shù):101
alpha值:1.50, 收斂次數(shù):92
alpha值:1.60, 收斂次數(shù):84
alpha值:1.70, 收斂次數(shù):96
alpha值:1.80, 收斂次數(shù):101
alpha值:1.90, 收斂次數(shù):101
從物理角度理解momentum算法:vdx = beta * vdx + (1-beta) * dw, dw可看作加速度宠默,beta小于1麸恍,可看作摩擦力,vdx可看作動(dòng)量
RMSProp
gd算法的alpha值受限于y軸(y軸斜率大)搀矫,rmsprop增加了微平方加權(quán)平均數(shù)抹沪,可以消除梯度大的維度的影響,設(shè)置更大的alpha值
gd算法:x := x - alpha * dx瓤球;
rmsprop算法:
sdx = beta * sdx + (1-beta) * dx ** 2,
x := x - alpha * dx / sqrt(sdx)
plt.contour(x, y, z)
plt.scatter(0,0)
plt.xlabel('x')
plt.ylabel('y')
#rmsprop
alpha = 0.2005
epsilon = 0.00000001
beta = 0.9
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
plt.scatter(t[0], t[1])
sdt = 0
for i in range(1, 100):
dt = np.dot(w, t)
sdt = beta * sdt + (1-beta) * dt ** 2
t_step = dt / (np.sqrt(sdt) + epsilon)
t = t - alpha * t_step
plt.scatter(t[0], t[1])
print("最終值")
print(t[0], t[1])
plt.show()
最終值
[-1.6552961e-19] [4.0731588e-20]
RMSProp算法的效率
def rmsprop(alpha, beta):
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
sdt = 0
i = 1
iter_array = []
result_array = []
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
while i < 100:
dt = np.dot(w, t)
sdt = beta * sdt + (1-beta) * dt ** 2
t_step = dt / (np.sqrt(sdt) + epsilon)
t = t - alpha * t_step
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
i += 1
return iter_array, result_array
beta = 0.9
for i in range(1, 20):
i_new = i / 10
i_iter_array, i_result_array = rmsprop(i_new, beta)
print('alpha值:%.2f, 收斂次數(shù):%d' %(i_new,get_convergence_iter(i_result_array, 0.0001)))
for i in [0.01, 0.1, 1, 2]:
i_iter_array, i_result_array = rmsprop(i, beta)
plt.plot(i_iter_array, i_result_array, label="alpha:%.2f" %(i))
plt.legend()
plt.show()
alpha值:0.10, 收斂次數(shù):101
alpha值:0.20, 收斂次數(shù):70
alpha值:0.30, 收斂次數(shù):51
alpha值:0.40, 收斂次數(shù):40
alpha值:0.50, 收斂次數(shù):33
alpha值:0.60, 收斂次數(shù):27
alpha值:0.70, 收斂次數(shù):23
alpha值:0.80, 收斂次數(shù):20
alpha值:0.90, 收斂次數(shù):18
alpha值:1.00, 收斂次數(shù):16
alpha值:1.10, 收斂次數(shù):14
alpha值:1.20, 收斂次數(shù):13
alpha值:1.30, 收斂次數(shù):12
alpha值:1.40, 收斂次數(shù):11
alpha值:1.50, 收斂次數(shù):10
alpha值:1.60, 收斂次數(shù):9
alpha值:1.70, 收斂次數(shù):8
alpha值:1.80, 收斂次數(shù):7
alpha值:1.90, 收斂次數(shù):7
adam
adam算法將momentum和rmsprop算法結(jié)合起來融欧,
vdx = beta1 * vdx + (1-beta1) * dx,
vdxc = vdx / (1 - beta1 ** t),
sdx = beta2 * sdx + (1 - beta2) * dx ** 2,
sdxc = sdx / (1 - beta2 ** t),
dx := dx - alpha * vdxc / sqrt(sdxc)
plt.contour(x, y, z)
plt.scatter(0,0)
plt.xlabel('x')
plt.ylabel('y')
#adam
alpha = 0.2005
epsilon = 0.00000001
beta1 = 0.9
beta2 = 0.999
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
plt.scatter(t[0], t[1])
vdt = 0
sdt = 0
for i in range(1,100):
dt = np.dot(w, t)
vdt = beta1 * vdt + (1-beta1) * dt
vdtc = vdt / (1 - beta1 ** i)
sdt = beta2 * sdt + (1-beta2) * dt ** 2
sdtc = sdt / (1 - beta2 ** i)
t_step = vdtc / (np.sqrt(sdtc) + epsilon)
t = t - alpha * t_step
plt.scatter(t[0], t[1])
print("最終值")
print(t[0], t[1])
plt.show()
最終值
[-0.09618689] [-0.04753836]
adma算法的效率
def adam(alpha, beta1, beta2):
w = np.array([2, 0, 0, 10]).reshape(2, 2)
t = np.array([9, 8]).reshape(2, 1)
vdt = 0
sdt = 0
i = 1
iter_array = []
result_array = []
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
while i < 100:
dt = np.dot(w, t)
vdt = beta1 * vdt + (1-beta1) * dt
vdtc = vdt / (1 - beta1 ** i)
sdt = beta * sdt + (1-beta) * dt ** 2
sdtc = sdt / (1 - beta2 ** i)
t_step = dt / (np.sqrt(sdt) + epsilon)
t = t - alpha * t_step
iter_array.append(i)
result = t[0] ** 2 + 5 * t[1] ** 2
result_array.append(result)
i += 1
return iter_array, result_array
beta1 = 0.9
beta2 = 0.999
for i in range(1, 20):
i_new = i / 10
i_iter_array, i_result_array = adam(i_new, beta1, beta2)
print('alpha值:%.2f, 收斂次數(shù):%d' %(i_new,get_convergence_iter(i_result_array, 0.0001)))
for i in [0.01, 0.1, 1, 2]:
i_iter_array, i_result_array = adam(i, beta1, beta2)
plt.plot(i_iter_array, i_result_array, label="alpha:%.2f" %(i))
plt.legend()
plt.show()
alpha值:0.10, 收斂次數(shù):101
alpha值:0.20, 收斂次數(shù):70
alpha值:0.30, 收斂次數(shù):51
alpha值:0.40, 收斂次數(shù):40
alpha值:0.50, 收斂次數(shù):33
alpha值:0.60, 收斂次數(shù):27
alpha值:0.70, 收斂次數(shù):23
alpha值:0.80, 收斂次數(shù):20
alpha值:0.90, 收斂次數(shù):18
alpha值:1.00, 收斂次數(shù):16
alpha值:1.10, 收斂次數(shù):14
alpha值:1.20, 收斂次數(shù):13
alpha值:1.30, 收斂次數(shù):12
alpha值:1.40, 收斂次數(shù):11
alpha值:1.50, 收斂次數(shù):10
alpha值:1.60, 收斂次數(shù):9
alpha值:1.70, 收斂次數(shù):8
alpha值:1.80, 收斂次數(shù):7
alpha值:1.90, 收斂次數(shù):7
局部最小與鞍點(diǎn)
在高維的情況下,算法更可能碰到鞍點(diǎn)冰垄,而非局部最小點(diǎn)(由于局部最小點(diǎn)需要所有維度在當(dāng)前點(diǎn)都為凸函數(shù)蹬癌,在高維情況下出現(xiàn)此情況都概率非常小)
鞍點(diǎn)對(duì)算法的影響是在平緩區(qū)域迭代速度減慢虹茶,可用adam等算法增加優(yōu)化速度
from mpl_toolkits.mplot3d import Axes3D
#鞍點(diǎn)
x = y = np.linspace(-10, 10, 1000)
x,y = np.meshgrid(x,y)
z = x ** 2 - y ** 2
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(x,y,z,cmap=plt.cm.coolwarm)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()
plt.contour(x,y,z)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
gd在鞍點(diǎn)的表現(xiàn)
#gradient descent
alpha = 0.1
w = np.array([2, 0, 0, -2]).reshape(2, 2)
t_0 = t = np.array([-9, -0.2]).reshape(2, 1)
z_0 = t_0[0][0]**2 - t[1][0] ** 2
t_array = []
z_array = []
for i in range(1, 20):
dt = np.dot(w, t)
t_step = dt
t = t - alpha * t_step
t_array.append(t)
z_value = t[0][0] ** 2 - t[1][0] ** 2
z_array.append(z_value)
print("最終值")
print(t[0], t[1])
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(x,y,z,cmap=plt.cm.coolwarm)
ax.scatter(t_0[0][0], t_0[1][0], z_0)
for i in range(len(t_array)):
ax.scatter(t_array[i][0], t_array[i][1], z_array[i])
plt.show()
plt.contour(x, y, z)
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(t_0[0], t_0[1])
for i in t_array:
plt.scatter(i[0], i[1])
plt.show()
最終值
[-0.12970367] [-6.38959999]
adam在鞍點(diǎn)的表現(xiàn)
gd算法中不能設(shè)置過大的alpha值逝薪,否則算法不能收斂,使用adam算法時(shí)蝴罪,可以設(shè)置較大的alpha值董济,加快迭代速度;同時(shí)adam算法平衡了x和y方向的迭代步長要门,使算法無須在x方向耗費(fèi)過長的時(shí)間
#adam
alpha = 2
epsilon = 0.00000001
beta1 = 0.9
beta2 = 0.999
w = np.array([2, 0, 0, -2]).reshape(2, 2)
t_0 = t = np.array([-9, -0.2]).reshape(2, 1)
vdt = 0
sdt = 0
t_array = []
z_array = []
for i in range(1,10):
dt = np.dot(w, t)
vdt = beta1 * vdt + (1-beta1) * dt
vdtc = vdt / (1 - beta1 ** i)
sdt = beta2 * sdt + (1-beta2) * dt ** 2
sdtc = sdt / (1 - beta2 ** i)
t_step = vdtc / (np.sqrt(sdtc) + epsilon)
t = t - alpha * t_step
t_array.append(t)
z_value = t[0][0] ** 2 - t[1][0] ** 2
z_array.append(z_value)
print("最終值")
print(t[0], t[1])
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(x,y,z,cmap=plt.cm.coolwarm)
ax.scatter(t_0[0][0], t_0[1][0], z_0)
for i in range(len(t_array)):
ax.scatter(t_array[i][0], t_array[i][1], z_array[i])
plt.show()
plt.contour(x, y, z)
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(t_0[0], t_0[1])
for i in t_array:
plt.scatter(i[0], i[1])
plt.show()
最終值
[3.81504286] [-16.860752]