鉆石價(jià)格預(yù)測(cè) Categorical特征
NO 字段名稱 數(shù)據(jù)類型 字段描述
1 carat Float 克拉數(shù)
2 cut String 切割工藝的評(píng)級(jí)私爷,分為5類 Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal
3 color String 顏色 Color of the diamond, with D being the best and J the wors
4 clarity String 鉆石凈度的評(píng)級(jí),分為8類How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
5 depth Float 深度百分比捌浩,鉆石高度除以平均直徑工秩,單位:%
6 table Float 臺(tái)面百分比,鉆石臺(tái)面寬度除以平均直徑浪听,單位:%
7 price Int 鉆石價(jià)格眉菱,單位:美元
8 x Float 長(zhǎng)度,單位:mm
9 y Float 寬度克伊,單位:mm
10 z Float 深度,單位:mm
# 加載包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 加載 warnings
import warnings
# 忽略 warnings
warnings.filterwarnings("ignore")
# 從csv文件中寫入數(shù)據(jù)
data = pd.read_csv('diamonds.csv')
print(plt.style.available) # 列出所有可用的繪圖樣式
plt.style.use('ggplot') # 使用“ggplot”樣式
# 查看特征值和目標(biāo)值
data.head()
Exploratory Data Analysis & 數(shù)據(jù)預(yù)處理
- 數(shù)據(jù)中無缺失值,不需要進(jìn)行特殊處理
- 數(shù)據(jù)的絕大部分維度特征相對(duì)合理
- 數(shù)據(jù)中存在int洗搂,float和類型數(shù)據(jù)的變量,將類型數(shù)據(jù)做進(jìn)一步處理
- 數(shù)據(jù)的個(gè)數(shù)有53940個(gè)
不分組knn
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
knn.fit(x,y)
prediction = knn.predict(x)
# format():格式化輸出
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))
Prediction: [ 326 326 327 ... 2039 2732 2489]
測(cè)試組占30% knn
from sklearn.model_selection import train_test_split
# 切分?jǐn)?shù)據(jù)集撵颊、測(cè)試集倡勇,固定隨機(jī)種子(保證數(shù)據(jù)集每次的切分都一樣)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
# 設(shè)置k值為3
knn = KNeighborsClassifier(n_neighbors = 3)
# 設(shè)置特征值和預(yù)測(cè)值
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
# 將模型擬合到訓(xùn)練集
knn.fit(x_train,y_train)
# 預(yù)測(cè)精準(zhǔn)度
prediction = knn.predict(x_test)
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))
Prediction: [ 449 6321 2131 ... 625 730 4168]
With KNN (K=3) accuracy is: 0.013842541095043875
參數(shù)調(diào)優(yōu)
# 模型復(fù)雜度
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# 循環(huán)K值從1到25
for i, k in enumerate(neig):
# k從1到25(不包括1嘉涌、25)
knn = KNeighborsClassifier(n_neighbors=k)
# 使用KNN擬合
knn.fit(x_train,y_train)
# 訓(xùn)練集的準(zhǔn)確度
train_accuracy.append(knn.score(x_train, y_train))
# 測(cè)試集的準(zhǔn)確度
test_accuracy.append(knn.score(x_test, y_test))
# 可視化
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))
可以看到knn效果不好仑最,因?yàn)槲覀円慕Y(jié)果是一個(gè)變量警医,knn 更適合分類
那么再試試其他方法
線性回歸
x = np.array(data.loc[:,'carat']).reshape(-1,1)
y = np.array(data.loc[:,'price']).reshape(-1,1)
# 線性回歸
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# 預(yù)測(cè)區(qū)域
predict_space = np.linspace(min(x), max(x)).reshape(-1,1)
# 將訓(xùn)練數(shù)據(jù)擬合到模型中
reg.fit(x,y)
# 預(yù)測(cè)
predicted = reg.predict(predict_space)
# R^2
print('R^2 score: ',reg.score(x, y))
# 繪制回歸線和散點(diǎn)
plt.plot(predict_space, predicted, color='black', linewidth=3)
plt.scatter(x=x,y=y)
plt.xlabel('carat')
plt.ylabel('price')
plt.show()
R^2 score: 0.8493305264354857
# Ridge
from sklearn.linear_model import Ridge
# 固定隨機(jī)種子,random_state=2得到的劃分與random_state=1時(shí)不同
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
ridge = Ridge(alpha = 0.1, normalize = True)
ridge.fit(x_train,y_train)
ridge_predict = ridge.predict(x_test)
print('Ridge score: ',ridge.score(x_test,y_test))
Ridge score: 0.8415434800632169
# Lasso
from sklearn.linear_model import Lasso
x = data[['carat','cut','clarity','depth','table','x','y','z']]
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
lasso = Lasso(alpha = 0.1, normalize = True)
lasso.fit(x_train,y_train)
ridge_predict = lasso.predict(x_test)
print('Lasso score: ',lasso.score(x_test,y_test))
print('Lasso coefficients: ',lasso.coef_)
Lasso score: 0.8854378481948613
Lasso coefficients: [8434.64906964 -123.88370107 -350.45873524 -38.29023158 -14.11709282
-36.3572615 -0. -36.2808792 ]
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print('RandomForest score: ',rf.score(x_test,y_test))
RandomForest score: 0.9365442332739605
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=0, loss='ls',verbose = 1).fit(x_train , y_train)
y_pred = gbr.predict(x_test)
print('GradientBoosting score: ',gbr.score(x_test,y_test))
Iter Train Loss Remaining Time
1 14094461.6429 0.48s
2 12496608.8546 0.67s
3 11168569.3479 0.61s
4 9986874.8068 0.58s
5 9008825.0389 0.56s
6 8133660.7414 0.56s
7 7402916.6391 0.55s
8 6762929.6866 0.53s
9 6204219.0082 0.52s
10 5728951.1243 0.51s
20 3198851.7843 0.41s
30 2385567.8147 0.35s
40 2077353.7183 0.34s
50 1886477.5667 0.30s
60 1755080.7615 0.23s
70 1660608.8724 0.17s
80 1592460.5433 0.11s
90 1541833.7987 0.06s
100 1504583.3182 0.00s
GradientBoosting score: 0.9066881031052523
可以發(fā)現(xiàn)隨機(jī)森林效果最好