原理
- 對分類邊界建立回歸公式蛹头,找到最佳擬合參數(shù)峦嗤,以此來進行分類挪哄。
優(yōu)點:
- 計算代價不高移怯,易于理解和實現(xiàn)。
缺點:
- 容易欠擬合和橙,分類精度可能不高仔燕。在數(shù)據(jù)不是完全線性可分的情況,永遠不會收斂魔招。
適用數(shù)據(jù)類型:
- 標稱型數(shù)據(jù)和數(shù)值型數(shù)據(jù)
#加載數(shù)據(jù)集
from numpy import mat
def loadDataSet():
dataMat = []
labelMat = []
fr = open('../../Reference Code/Ch05/testSet.txt')
#type(fr.readlines()): list;
#type(fr.read()): str
for line in fr.readlines():
#line = '-0.017612\t14.053064\t0\n'
#lineList = ['-0.017612', '14.053064', '0']
lineList = line.strip().split() #需要對line進行分割處理晰搀,才能成為一個list
dataMat.append([1.0,float(lineList[0]),float(lineList[1])])
labelMat.append(int(lineList[2]))
return mat(dataMat),mat(labelMat)
dataMat,labelMat = loadDataSet()
from numpy import *
#定義sigmoid函數(shù)
def sigmoid(inX):
return 1.0/(1+exp(-inX))
#定義梯度上升算法
def gradAscent(dataMat,labelMat):
labelMat = labelMat.T #轉(zhuǎn)置變成m行1列
m,n = shape(dataMat) #(m,n)
weight = ones((n,1)) #(n,1)
maxCycle = 500 #最大迭代次數(shù)
a=0.001
for k in range(maxCycle):
h = sigmoid(dataMat*weight) #(m,1)
err = labelMat - h #(m,1)
weight = weight + a*dataMat.T*err
return weight
weight = gradAscent(dataMat,labelMat)
weight
matrix([[ 4.12414349],
[ 0.48007329],
[-0.6168482 ]])
#畫出決策邊界
import matplotlib.pyplot as plt
def plotBestFit(weight):
#加載數(shù)據(jù)
dataMat,labelMat = loadDataSet()
#篩選出y=1和y=0兩類樣本數(shù)據(jù)
x1_1 = [];x2_1= [];
x1_0 = [];x2_0= [];
m = shape(dataList)[0] #m行
for i in range(m):
if labelList[i] == 1:
x1_1.append(dataMat[i,1]);x2_1.append(dataMat[i,2]);
else:
x1_0.append(dataMat[i,1]);x2_0.append(dataMat[i,2]);
#擬合的直線: w0+w1*x1+w2*x2=0
weight = array(weight) #matrix 轉(zhuǎn)化為 array格式
x1 = arange(-4,4,0.001)#array
x2 = (-weight[0]-weight[1]*x1)/weight[2]#array
#畫圖
plt.figure()
plt.scatter(x1_1,x2_1,c='r',label='Class 1') #畫出類別1的樣本
plt.scatter(x1_0,x2_0,c='g',label='Class 0') #畫出類別0的樣本
plt.plot(x1,x2) #若x1或者x2是matrix格式,是畫不出圖的
plt.legend()
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
plotBestFit(weight)
output_2_0.png
梯度上升VS隨機梯度上升
- 梯度上升:每次更新權(quán)重weight都要重新計算所有的樣本办斑,計算復(fù)雜度高外恕;屬于批處理;
- 代碼:
for k in range(maxCycle):
h = sigmoid(dataMat*weight) #(m,1)
err = labelMat - h #(m,1)
weight = weight + a*dataMat.transpose()*err
- 隨機梯度上升:每次進來一個樣本就更新一次權(quán)重weight乡翅;屬于在線學習鳞疲;
for i in range(m):
h = sigmoid(dataMat[i]*weight) #(1,1)
err = labelMat[i] - h #(1,1)
weight = weight + a*dataMat[i].transpose()*err#(n,1)
```
```python
#隨機梯度上升算法
from numpy import *
def stocGradAscent(dataMat,labelMat):
labelMat = labelMat.transpose() #轉(zhuǎn)置變成m行1列
m,n = shape(dataMat)
a=0.01
weight = ones((n,1)) #(n,1)
for i in range(m):
h = sigmoid(dataMat[i]*weight) #(1,1)
err = labelMat[i] - h #(1,1)
weight = weight + a*dataMat[i].transpose()*err#(n,1)
return weight
weight = stocGradAscent(dataMat,labelMat)
weight
matrix([[ 1.01702007],
[ 0.85914348],
[-0.36579921]])
#隨機梯度上升算法的決策邊界
plotBestFit(weight)
output_5_0.png
#改進的隨機梯度上升算法
import random
def uptatestocGradAscent(dataMat,labelMat,numIter = 150):
labelMat=labelMat.T
m,n = shape(dataMat)
weight = ones((n,1))
randomIndex = random.sample(range(m),m)
for numiter in range(numIter):
for i in range(m):
a = 4/(1.0+numiter+i)+0.01 #采用動態(tài)的alpha,隨著迭代次數(shù)而不斷減小蠕蚜,可以緩解權(quán)重weight的波動
index = randomIndex[i] #隨機抽取樣本來更新權(quán)重尚洽,可以減少權(quán)重的周期性波動
h = sigmoid(dataMat[index]*weight)
err = labelMat[index] - h
weight = weight + a*dataMat[index].T*err
return weight
weight = uptatestocGradAscent(dataMat,labelMat,numIter = 20)
weight
matrix([[11.53934818],
[ 1.37987182],
[-1.50178942]])
#改進的隨機梯度上升算法的決策邊界
plotBestFit(weight)
output_7_0.png
隨機梯度上升VS改進的隨機梯度上升
#隨機梯度上升算法
from numpy import *
import matplotlib.pyplot as plt
def stocGradAscentPlot(dataMat,labelMat):
labelMat = labelMat.transpose() #轉(zhuǎn)置變成m行1列
m,n = shape(dataMat)
a=0.01
weight = ones((n,1)) #(n,1)
x0=[];x1=[];x2=[]
numIter = 500
for i in range(numIter):#迭代200次
for i in range(m):
h = sigmoid(dataMat[i]*weight) #(1,1)
err = labelMat[i] - h #(1,1)
weight = weight + a*dataMat[i].transpose()*err#(n,1)
x0.append(float(weight[0]))
x1.append(float(weight[1]))
x2.append(float(weight[2]))
#畫圖weight
plt.figure(figsize=(8,10))
plt.subplot(311)
plt.plot(range(numIter),x0)
plt.ylabel('X0')
# plt.yticks(linspace(1,10,20))
plt.subplot(312)
plt.plot(range(numIter),x1)
plt.ylabel('X1')
plt.subplot(313)
plt.plot(range(numIter),x2)
plt.ylabel('X2')
plt.tight_layout()
plt.show()
stocGradAscentPlot(dataMat,labelMat)
output_9_0.png
#改進的隨機梯度上升算法
import random
def uptatestocGradAscentPlot(dataMat,labelMat,numIter = 150):
labelMat=labelMat.T
m,n = shape(dataMat)
weight = ones((n,1))
randomIndex = random.sample(range(m),m)
x0=[];x1=[];x2=[]
for numiter in range(numIter):
for i in range(m):
a = 4/(1.0+numiter+i)+0.01 #采用動態(tài)的alpha,隨著迭代次數(shù)而不斷減小靶累,可以緩解權(quán)重weight的波動
index = randomIndex[i] #隨機抽取樣本來更新權(quán)重腺毫,可以減少權(quán)重的周期性波動
h = sigmoid(dataMat[index]*weight)
err = labelMat[index] - h
weight = weight + a*dataMat[index].T*err
x0.append(float(weight[0]))
x1.append(float(weight[1]))
x2.append(float(weight[2]))
#畫圖weight
plt.figure(figsize=(8,10))
plt.subplot(311)
plt.plot(range(numIter),x0)
plt.ylabel('X0')
plt.subplot(312)
plt.plot(range(numIter),x1)
plt.ylabel('X1')
plt.subplot(313)
plt.plot(range(numIter),x2)
plt.ylabel('X2')
plt.tight_layout()
plt.show()
uptatestocGradAscentPlot(dataMat,labelMat)
output_10_0.png
從疝氣病癥預(yù)測病馬的死亡率
#分類函數(shù)
def classify(inX,weight): #inX和weight都是矩陣形式
res = sigmoid(float(inX*weight))
# inX*weight 為一個矩陣,得到的res也是一個矩陣
# float(inX*weight) 為一個數(shù)挣柬,得到的res為一個數(shù)
if res > 0.5:
return 1
else:
return 0
#讀取數(shù)據(jù)
from numpy import mat
def createData():
#訓練數(shù)據(jù)
fr = open('../../Reference Code/Ch05/horseColicTraining.txt')
xTrain = [];yTrain=[]
for line in fr.readlines():
currentLine = line.strip().split()
lineList = []
lineList.append(1.0)
for i in range(len(currentLine)-1):
lineList.append(float(currentLine[i]))
# for i in range(len(currentLine)-1):
# lineList.append(float(currentLine[i]))
xTrain.append(lineList) #構(gòu)建訓練樣本集
yTrain.append(float(currentLine[-1])) #構(gòu)建訓練標簽集
#測試數(shù)據(jù)
fr = open('../../Reference Code/Ch05/horseColicTest.txt')
xTest = [];yTest=[]
for line in fr.readlines():
currentLine = line.strip().split()
lineList = []
lineList.append(1.0)
for i in range(len(currentLine)-1):
lineList.append(float(currentLine[i]))
# for i in range(len(currentLine)-1):
# lineList.append(float(currentLine[i]))
xTest.append(lineList) #構(gòu)建訓練樣本集
yTest.append(float(currentLine[-1])) #構(gòu)建訓練標簽集
return mat(xTrain),mat(yTrain),mat(xTest),mat(yTest)
#訓練數(shù)據(jù)潮酒,擬合最佳參數(shù)weight,并測試結(jié)果
def colicTest():
#創(chuàng)建數(shù)據(jù)
xTrainMat,yTrainMat,xTestMat,yTestMat = createData()
#擬合最佳參數(shù)weight
weight = uptatestocGradAscent(xTrainMat,yTrainMat,150)
#測試結(jié)果
errCount = 0
yTestMat = yTestMat.T #轉(zhuǎn)置之后,才能用yTestMat[i]取到每個數(shù)
for i in range(len(xTestMat)):
res = classify(xTestMat[i],weight)
if res != int(yTestMat[i]):
errCount +=1
errRate = errCount/len(yTestMat)
print('The error rate of this test is: %f' %errRate)
return errRate
xTrainMat,yTrainMat,xTestMat,yTestMat = createData()
xTrainMat
matrix([[ 1. , 2. , 1. , ..., 8.4, 0. , 0. ],
[ 1. , 1. , 1. , ..., 85. , 2. , 2. ],
[ 1. , 2. , 1. , ..., 6.7, 0. , 0. ],
...,
[ 1. , 1. , 1. , ..., 6.8, 0. , 0. ],
[ 1. , 1. , 1. , ..., 6. , 3. , 3.4],
[ 1. , 1. , 1. , ..., 62. , 1. , 1. ]])
errRate = colicTest()
errRate
The error rate of this test is: 0.402985
0.40298507462686567
def multiTest(numTests = 10):
errRate = 0.0
for k in range(numTests):
errRate += colicTest()
print('after %d iterations the average error rate is:%f' %(numTests,errRate/numTests))
multiTest(numTests = 10)
The error rate of this test is 0.238806
The error rate of this test is 0.402985
The error rate of this test is 0.208955
The error rate of this test is 0.238806
The error rate of this test is 0.567164
The error rate of this test is 0.611940
The error rate of this test is 0.298507
The error rate of this test is 0.373134
The error rate of this test is 0.298507
The error rate of this test is 0.343284
after 10 iterations the average error rate is:0.358209
multiTest(numTests = 10)
The error rate of this test is: 0.432836
The error rate of this test is: 0.373134
The error rate of this test is: 0.298507
The error rate of this test is: 0.328358
The error rate of this test is: 0.701493
The error rate of this test is: 0.432836
The error rate of this test is: 0.343284
The error rate of this test is: 0.328358
The error rate of this test is: 0.268657
The error rate of this test is: 0.552239
after 10 iterations the average error rate is:0.405970