data.PNG
這個數(shù)據(jù)集是某社交網(wǎng)絡(luò)的用戶信息梁厉,有Uesr ID、Gender、Age迷捧、EstimatedSalary。某汽車公司生產(chǎn)了新型豪華SUV胀葱,我們試圖找出社交網(wǎng)絡(luò)中的哪些用戶會買這款新車漠秋。數(shù)據(jù)最后一列Purchased表示用戶是否購買了這款車。我們希望通過Age和EstimatedSalary兩個變量抵屿,建立一個模型庆锦,來預(yù)測用戶是否會購買這款車。所以我們的特征矩陣只包含這兩列轧葛,來研究Age搂抒、EstimatedSalary和是否購買之間的關(guān)系。
一尿扯、數(shù)據(jù)預(yù)處理
- 導(dǎo)入庫
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
- 導(dǎo)入數(shù)據(jù)
df = pd.read_csv('D:\\data\\Social_Network_Ads.csv')
X = df.iloc[:,2:4]
Y = df.iloc[:,-1]
- 分割數(shù)據(jù)集
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
- 數(shù)據(jù)標準化
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss = ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)
二求晶、訓(xùn)練K-NN模型
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2 )
knn.fit(X_train, Y_train)
三、預(yù)測測試集結(jié)果
Y_pred = knn.predict(X_test)
四衷笋、效果評估
- 混淆矩陣
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
array([[60, 6],
[ 6, 28]], dtype=int64)
- 可視化
訓(xùn)練集可視化
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, Y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, knn.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.5, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('KNN (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
測試集可視化
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, Y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, knn.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.5, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-NN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()