在線課堂——支持向量機(jī)實(shí)例學(xué)習(xí)筆記犯助。
支持向量機(jī)簡(jiǎn)介
支持向量機(jī)是一種監(jiān)督學(xué)習(xí)數(shù)學(xué)模型,由n個(gè)變量組成的數(shù)據(jù)項(xiàng)都可以抽象成n維空間內(nèi)的一個(gè)點(diǎn),點(diǎn)的各個(gè)維度坐標(biāo)值即為各個(gè)變量赎离。如果一堆數(shù)據(jù)項(xiàng)可以分為m個(gè)類,那么可以構(gòu)建m-1個(gè)n維超平面將不同種類的數(shù)據(jù)項(xiàng)的點(diǎn)盡量分隔開(kāi)端辱,則這些超平面為支持向量面梁剔,這個(gè)分類數(shù)學(xué)模型為支持向量機(jī)分類模型。
Classification分析——鳶尾花數(shù)據(jù)集
Scikit-Learn自帶鳶尾花數(shù)據(jù)集舞蔽,可使用datasets.load_iris()
載入荣病。
- data——每行是某個(gè)鳶尾花的花萼長(zhǎng)度、花萼寬度渗柿、花瓣長(zhǎng)度个盆、花瓣寬度。
- target——第n個(gè)數(shù)據(jù)分別表示data段第n行數(shù)據(jù)所對(duì)應(yīng)的鳶尾花類別編號(hào)(共3類)朵栖。
首先颊亮,使用交叉驗(yàn)證法進(jìn)行分析。由于交叉驗(yàn)證法每次選取的測(cè)試集是隨機(jī)的陨溅,因此每次運(yùn)算結(jié)果未必相同终惑。下面為鳶尾花數(shù)據(jù)集的SVM聚類訓(xùn)練的源碼,并用交叉驗(yàn)證法進(jìn)行分析门扇。
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from numpy import *
# download the dataset
iris_dataset = datasets.load_iris()
iris_data = iris_dataset.data
iris_target = iris_dataset.target
# split data and target into training set and testing set
# 80% training, 20% testing
x_train, x_test, y_train, y_test = train_test_split(iris_data, iris_target, test_size = 0.2)
# construct SVC by using rbf as kernel function
SVC_0 = SVC(kernel = 'rbf')
SVC_0.fit(x_train, y_train)
predict = SVC_0.predict(x_test)
right = sum(predict == y_test)
# accuracy rate
print("%f%%" % (right * 100.0 / predict.shape[0]))
以下源碼是使用留一驗(yàn)證法(Leave-One-Out雹有,LOO)對(duì)鳶尾花數(shù)據(jù)集進(jìn)行分析偿渡。
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from numpy import *
def data_svc_test(data, target, index):
x_train = vstack((data[0: index], data[index + 1: -1]))
x_test = data[index]
y_train = hstack((target[0: index], target[index + 1: -1]))
y_test = target[index]
SVC_0 = SVC(kernel = 'rbf')
SVC_0.fit(x_train, y_train)
predict = SVC_0.predict(x_test)
return predict == y_test
# download the dataset
iris_dataset = datasets.load_iris()
iris_data = iris_dataset.data
iris_target = iris_dataset.target
length = iris_target.shape[0]
right = 0
for i in range(0, length):
right += data_svc_test(iris_data, iris_target, i)
# accuracy rate
print("%f%%" % (right * 100.0 / length))
Regression分析——波士頓房?jī)r(jià)數(shù)據(jù)集
Scikit-learn自帶波士頓房?jī)r(jià)集,該數(shù)據(jù)集來(lái)源于1978年美國(guó)某經(jīng)濟(jì)學(xué)雜志上件舵,可由datasets.load_boston()
載入卸察。該數(shù)據(jù)集包含若干波士頓房屋的價(jià)格及其各項(xiàng)數(shù)據(jù),每個(gè)數(shù)據(jù)項(xiàng)包含14個(gè)數(shù)據(jù)铅祸,分別是房屋均價(jià)及周邊犯罪率坑质、是否在河邊等相關(guān)信息,其中最后一個(gè)數(shù)據(jù)是房屋均價(jià)临梗。
這里涉及到了一個(gè)數(shù)據(jù)預(yù)處理的步驟——為了便于后續(xù)訓(xùn)練涡扼,需要對(duì)讀取到的數(shù)據(jù)進(jìn)行處理。因?yàn)橛绊懛績(jī)r(jià)的數(shù)據(jù)的范圍都不一致盟庞,這些數(shù)據(jù)都不在一個(gè)數(shù)量級(jí)上吃沪,如果直接使用未經(jīng)預(yù)處理的數(shù)據(jù)進(jìn)行訓(xùn)練,很容易導(dǎo)致數(shù)值大的數(shù)據(jù)對(duì)結(jié)果影響極大什猖,從而不能平衡的體現(xiàn)出各個(gè)數(shù)據(jù)的重要性票彪。因此需要通過(guò)數(shù)學(xué)方法,依據(jù)方差不狮、平均值等因素降铸,把各類數(shù)據(jù)放縮到一個(gè)相同的范圍內(nèi),使其影響力所占權(quán)重相近摇零。
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVR
# preprocessing function
from sklearn.preprocessing import StandardScaler
from numpy import *
house_dataset = datasets.load_boston()
house_data = house_dataset.data
house_price = house_dataset.target
x_train, x_test, y_train, y_test = train_test_split(house_data, house_price, test_size = 0.2)
# f(x) = (x - means) / standard deviation
scaler = StandardScaler()
scaler.fit(x_train)
# standardization
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
# construct SVR model
svr = SVR(kernel = 'rbf')
svr.fit(x_train, y_train)
y_predict = svr.predict(x_test)
result = hstack((y_test.reshape(-1, 1), y_predict.reshape(-1, 1)))
print(result)
最后預(yù)測(cè)結(jié)果呈2列顯示推掸,第1列為實(shí)際房?jī)r(jià),第2列為預(yù)測(cè)房?jī)r(jià)驻仅,此處略谅畅。