1. 常規(guī)流程
導(dǎo)包 --> 實(shí)例化模型對象(有參:k) --> 拆分訓(xùn)練與測試集 --> 擬合(訓(xùn)練)模型 --> 評估 --> 參數(shù)調(diào)優(yōu)
1.1 必導(dǎo)包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" #全部輸出
1.2 實(shí)例化對象
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
1.3 引入數(shù)據(jù),以乳腺癌數(shù)據(jù)為例
from sklearn.datasets import load_breast_cance
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data,columns=name) #模型輸入為二維齿穗,ndarray和DF都可以傲隶,DF方便觀察
y = cancer.target
1.4 切分?jǐn)?shù)據(jù)、擬合窃页、預(yù)測跺株、評估
# 切分?jǐn)?shù)據(jù)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split (X,y,random_state = 0)
# 擬合
knn.fit(X_train,y_train)
# 預(yù)測
knn.predict(預(yù)測數(shù)據(jù))
# 評估
knn.score(X_test,y_test)
print("knn.score(): \n{:.2f}".format(knn.score(X_test,y_test)))
2. K的參數(shù)調(diào)節(jié)
train_acc = []
test_acc = []
# n_neighbors取值從1到10
neighbors_settings = range(2, 31)
for k in neighbors_settings:
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train,y_train)
train_acc.append(clf.score(X_train,y_train))
test_acc.append(clf.score(X_test,y_test))
plt.plot(neighbors_settings,train_acc,label="training accuracy")
plt.plot(neighbors_settings, test_acc, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("K")
plt.legend()
#注意,切分的隨機(jī)數(shù)種子會(huì)影響學(xué)習(xí)參數(shù)曲線脖卖,
np.argmax(test_acc) #返回最大值對應(yīng)索引乒省,K從2開始,所以15對應(yīng)K=17
3. 交叉驗(yàn)證: 為了解決knn.score評估結(jié)果不穩(wěn)定畦木,K也就不穩(wěn)定
3.1 實(shí)現(xiàn)流程
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, cancer.data, cancer.target,cv=5) #默認(rèn)5折 作儿,參數(shù):模型,X馋劈,y攻锰,幾折
print("scores: {}".format(scores))
mean_score = scores.mean()
print("mean_scores: {:.2f}".format(mean_score))
3.2 在學(xué)習(xí)曲線中用交叉驗(yàn)證
train_acc = []
test_acc = []
cross_acc = []
# n_neighbors取值從2到30
neighbors_settings = range(2, 31)
for k in neighbors_settings:
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train,y_train)
train_acc.append(clf.score(X_train,y_train))
test_acc.append(clf.score(X_test,y_test))
cross_acc.append(cross_val_score(clf, cancer.data, cancer.target,cv=5).mean())
#交叉驗(yàn)證用的數(shù)據(jù)集最好用切分后的訓(xùn)練集械姻,因?yàn)槭潜浑S機(jī)打亂過的
plt.plot(neighbors_settings,train_acc,label="training accuracy")
plt.plot(neighbors_settings, test_acc, label="test accuracy")
plt.plot(neighbors_settings, cross_acc, label="cross accuracy")
plt.ylabel("Accuracy")
plt.xlabel("K")
plt.legend()
np.argmax(cross_acc) #返回最大值對應(yīng)索引吏奸,K從2開始,所以11對應(yīng)K=13
4. 歸一化(0-1標(biāo)準(zhǔn)化)
- 公式:(x-min)/(max-min)
- 為了解決單個(gè)數(shù)據(jù)維度過大影響結(jié)果的問題,譬如身高與身價(jià)分別作x,y求距離時(shí),身高影響非常小
- 結(jié)果相當(dāng)于比例關(guān)系
- 語法:
fit(self, X[, y]): 生成標(biāo)準(zhǔn)化的規(guī)則
transform(self, X): 根據(jù)上面生成的規(guī)則,對數(shù)據(jù)進(jìn)行轉(zhuǎn)換
fit_transform(self, X[, y]): 把上面兩步合并成一步
4.1 流程
# 導(dǎo)包 --> 實(shí)例化 --> fit(被拆分過的訓(xùn)練集) --> 分別對訓(xùn)練集和測試集標(biāo)準(zhǔn)化
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
# 先fit學(xué)習(xí)訓(xùn)練集的數(shù)據(jù)信息(最大最小值等)砚作,然后以此去標(biāo)準(zhǔn)化,測試集永遠(yuǎn)沒有fit
minmax.fit(X_train) #fit只能對訓(xùn)練集,即使是對測試集轉(zhuǎn)化也是用這個(gè)
X_train_minmax = minmax.transform(X_train) #ndarray
X_test_minmax = minmax.transform(X_test)
# 或者 minmax.fit_transform(X_train, X_train) 一步完成
4.2 用標(biāo)準(zhǔn)化數(shù)據(jù)進(jìn)行訓(xùn)練調(diào)參
# 用標(biāo)準(zhǔn)化數(shù)據(jù)進(jìn)行訓(xùn)練評估
# 在學(xué)習(xí)曲線中用交叉驗(yàn)證
train_acc = []
test_acc = []
cross_acc = []
# n_neighbors取值從2到30
neighbors_settings = range(2, 31)
for k in neighbors_settings:
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train_minmax,y_train)
train_acc.append(clf.score(X_train_minmax,y_train))
test_acc.append(clf.score(X_test_minmax,y_test))
cross_acc.append(cross_val_score(clf, X_train_minmax, y_train,cv=5).mean())
plt.plot(neighbors_settings,train_acc,label="training accuracy")
plt.plot(neighbors_settings, test_acc, label="test accuracy")
plt.plot(neighbors_settings, cross_acc, label="cross accuracy")
plt.ylabel("Accuracy")
plt.xlabel("K")
plt.legend()
取最優(yōu)結(jié)果及其索引
max_score = np.max(cross_acc)
max_index = np.argmax(cross_acc) # 然后輸出值+2 重新建模得到最優(yōu)模型