使用sklearn中自帶的手寫數(shù)字?jǐn)?shù)據(jù)集
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
(1797, 8, 8)
import matplotlib.pyplot as plt
fig,axes = plt.subplots(10,10,figsize=(8,8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1,wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
X = digits.data
X.shape
(1797, 64)
y = digits.target
y.shape
(1797,)
圖(略)
將原本1797個(gè) 8像素 8像素 的數(shù)據(jù)水醋,平鋪成64的一維數(shù)組潘懊,[n_samples, n_features] = (1797, 64)
即1797個(gè)樣本筋栋,64個(gè)特征
使用流形學(xué)習(xí)算法中的Isomap對(duì)64維的數(shù)據(jù)進(jìn)行降維
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape
(1797, 2)
plt.scatter(data_projected[:,0], data_projected[:,1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('Spectral',10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5)
41ad91dae62d46a1ab5ae371427cfa67.png
觀察各個(gè)數(shù)字在參數(shù)空間中的分離程度還可以,用一個(gè)非常簡(jiǎn)單的有監(jiān)督分類算法就可以完成任務(wù)斜棚。
將數(shù)據(jù)分成訓(xùn)練集合測(cè)試集,然后用高斯樸素貝葉斯模型來(lái)擬合义钉,準(zhǔn)確率83%
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
計(jì)算混淆矩陣(confusion matrix)
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, y_model)
sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value')
7215af5184474bceaa5baa439168fc01.png
將識(shí)別錯(cuò)誤的標(biāo)記出來(lái):
fig,axes = plt.subplots(10,10,figsize=(8,8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1,wspace=0.1))
test_images = Xtest.reshape(-1, 8, 8)
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(y_model[i]),
transform=ax.transAxes,
color='green' if (ytest[i] == y_model[i]) else 'red')
使用其他幾種算法試一試:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
svm_model = SVC(gamma='auto')
svm_model.fit(Xtrain, ytrain)
y_pred_svm = svm_model.predict(Xtest)
print("SVM Accuracy: ", accuracy_score(ytest, y_pred_svm))
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(Xtrain, ytrain)
y_pred_rf = rf_model.predict(Xtest)
print("Random Forest Accuracy: ", accuracy_score(ytest, y_pred_rf))
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(Xtrain, ytrain)
y_pred_knn = knn_model.predict(Xtest)
print("KNN Accuracy: ", accuracy_score(ytest, y_pred_knn))
import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.fit(Xtrain, ytrain)
y_pred_xgb = xgb_model.predict(Xtest)
print("XGBoost Accuracy: ", accuracy_score(ytest, y_pred_xgb))
使用了支持向量機(jī)(SVM)、隨機(jī)森林(Random Forest)边酒、K最近鄰(KNN)氓涣、梯度提升決策樹(shù)(GBDT)XGBoost
SVM Accuracy: 0.4866666666666667
Random Forest Accuracy: 0.9777777777777777
KNN Accuracy: 0.9866666666666667
XGBoost Accuracy: 0.9555555555555556
重點(diǎn)關(guān)注一下缆毁,導(dǎo)入模型到涂、初始化模型脊框、擬合數(shù)據(jù)、預(yù)測(cè)數(shù)據(jù)践啄,這個(gè)步驟:
graph LR
導(dǎo)入模型 --> 初始化模型 --> 擬合數(shù)據(jù) --> 預(yù)測(cè)數(shù)據(jù)
參考:
[1]美 萬(wàn)托布拉斯 (VanderPlas, Jake).Python數(shù)據(jù)科學(xué)手冊(cè)[M].人民郵電出版社,2018.