kaggle 鉆石價(jià)格預(yù)測(cè)

鉆石價(jià)格預(yù)測(cè) Categorical特征

NO 字段名稱數(shù)據(jù)類型字段描述
1 carat Float 克拉數(shù)
2 cut String 切割工藝的評(píng)級(jí)私爷，分為5類 Describe cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal
3 color String 顏色 Color of the diamond, with D being the best and J the wors
4 clarity String 鉆石凈度的評(píng)級(jí)，分為8類How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
5 depth Float 深度百分比捌浩，鉆石高度除以平均直徑工秩，單位：%
6 table Float 臺(tái)面百分比，鉆石臺(tái)面寬度除以平均直徑浪听，單位：%
7 price Int 鉆石價(jià)格眉菱，單位：美元
8 x Float 長(zhǎng)度，單位：mm
9 y Float 寬度克伊，單位：mm
10 z Float 深度，單位：mm

# 加載包
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# 加載 warnings
import warnings

# 忽略 warnings
warnings.filterwarnings("ignore")

# 從csv文件中寫入數(shù)據(jù)
data = pd.read_csv('diamonds.csv')
print(plt.style.available)   # 列出所有可用的繪圖樣式
plt.style.use('ggplot')      # 使用“ggplot”樣式

# 查看特征值和目標(biāo)值
data.head()

Exploratory Data Analysis & 數(shù)據(jù)預(yù)處理

數(shù)據(jù)中無缺失值，不需要進(jìn)行特殊處理
數(shù)據(jù)的絕大部分維度特征相對(duì)合理
數(shù)據(jù)中存在int洗搂，float和類型數(shù)據(jù)的變量，將類型數(shù)據(jù)做進(jìn)一步處理
數(shù)據(jù)的個(gè)數(shù)有53940個(gè)

不分組knn

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
knn.fit(x,y)
prediction = knn.predict(x)
# format()：格式化輸出
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))

Prediction: [ 326 326 327 ... 2039 2732 2489]

測(cè)試組占30% knn

from sklearn.model_selection import train_test_split
# 切分?jǐn)?shù)據(jù)集撵颊、測(cè)試集倡勇，固定隨機(jī)種子（保證數(shù)據(jù)集每次的切分都一樣）
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
# 設(shè)置k值為3
knn = KNeighborsClassifier(n_neighbors = 3)
# 設(shè)置特征值和預(yù)測(cè)值
x = data[['carat','cut','clarity','depth','table','x','y','z']]
y = data.loc[:,'price']
# 將模型擬合到訓(xùn)練集
knn.fit(x_train,y_train)
# 預(yù)測(cè)精準(zhǔn)度
prediction = knn.predict(x_test)
print('Prediction: {}'.format(prediction))
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))

Prediction: [ 449 6321 2131 ... 625 730 4168]
With KNN (K=3) accuracy is: 0.013842541095043875

參數(shù)調(diào)優(yōu)

# 模型復(fù)雜度
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# 循環(huán)K值從1到25
for i, k in enumerate(neig):
    # k從1到25(不包括1嘉涌、25)
    knn = KNeighborsClassifier(n_neighbors=k)
    # 使用KNN擬合
    knn.fit(x_train,y_train)
    # 訓(xùn)練集的準(zhǔn)確度
    train_accuracy.append(knn.score(x_train, y_train))
    # 測(cè)試集的準(zhǔn)確度
    test_accuracy.append(knn.score(x_test, y_test))
    
# 可視化    
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

可以看到knn效果不好仑最，因?yàn)槲覀円慕Y(jié)果是一個(gè)變量警医，knn 更適合分類
那么再試試其他方法

線性回歸

x = np.array(data.loc[:,'carat']).reshape(-1,1)
y = np.array(data.loc[:,'price']).reshape(-1,1)
# 線性回歸
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# 預(yù)測(cè)區(qū)域
predict_space = np.linspace(min(x), max(x)).reshape(-1,1)
# 將訓(xùn)練數(shù)據(jù)擬合到模型中
reg.fit(x,y)
# 預(yù)測(cè)
predicted = reg.predict(predict_space)
# R^2 
print('R^2 score: ',reg.score(x, y))
# 繪制回歸線和散點(diǎn)
plt.plot(predict_space, predicted, color='black', linewidth=3)
plt.scatter(x=x,y=y)
plt.xlabel('carat')
plt.ylabel('price')
plt.show()

R^2 score: 0.8493305264354857

# Ridge
from sklearn.linear_model import Ridge
# 固定隨機(jī)種子，random_state=2得到的劃分與random_state=1時(shí)不同
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
ridge = Ridge(alpha = 0.1, normalize = True)
ridge.fit(x_train,y_train)
ridge_predict = ridge.predict(x_test)
print('Ridge score: ',ridge.score(x_test,y_test))

Ridge score: 0.8415434800632169

# Lasso
from sklearn.linear_model import Lasso
x =  data[['carat','cut','clarity','depth','table','x','y','z']]
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
lasso = Lasso(alpha = 0.1, normalize = True)
lasso.fit(x_train,y_train)
ridge_predict = lasso.predict(x_test)
print('Lasso score: ',lasso.score(x_test,y_test))
print('Lasso coefficients: ',lasso.coef_)

Lasso score: 0.8854378481948613
Lasso coefficients: [8434.64906964 -123.88370107 -350.45873524 -38.29023158 -14.11709282
-36.3572615 -0. -36.2808792 ]

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print('RandomForest score: ',rf.score(x_test,y_test))

RandomForest score: 0.9365442332739605

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=0, loss='ls',verbose = 1).fit(x_train , y_train)
y_pred = gbr.predict(x_test)
print('GradientBoosting score: ',gbr.score(x_test,y_test))

  Iter       Train Loss   Remaining Time 
     1    14094461.6429            0.48s
     2    12496608.8546            0.67s
     3    11168569.3479            0.61s
     4     9986874.8068            0.58s
     5     9008825.0389            0.56s
     6     8133660.7414            0.56s
     7     7402916.6391            0.55s
     8     6762929.6866            0.53s
     9     6204219.0082            0.52s
    10     5728951.1243            0.51s
    20     3198851.7843            0.41s
    30     2385567.8147            0.35s
    40     2077353.7183            0.34s
    50     1886477.5667            0.30s
    60     1755080.7615            0.23s
    70     1660608.8724            0.17s
    80     1592460.5433            0.11s
    90     1541833.7987            0.06s
   100     1504583.3182            0.00s

GradientBoosting score: 0.9066881031052523

可以發(fā)現(xiàn)隨機(jī)森林效果最好

最后編輯于：2019.01.26 11:41:39

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末吟温，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子鲁豪，更是在濱河造成了極大的恐慌爬橡，老刑警劉巖，帶你破解...
沈念sama閱讀 219,427評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異迁客，居然都是意外死亡辞槐，警方通過查閱死者的電腦和手機(jī)粘室，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,551評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門衔统，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人舱殿，你說我怎么就攤上這事险掀。” “怎么了樟氢？”我有些...
開封第一講書人閱讀 165,747評(píng)論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵埠啃，是天一觀的道長(zhǎng)。經(jīng)常有香客問我碴开，道長(zhǎng)叹螟，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,939評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任畏线，我火速辦了婚禮良价，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘明垢。我一直安慰自己，他們只是感情好抵蚊，可當(dāng)我...
茶點(diǎn)故事閱讀 67,955評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布贞绳。她就那樣靜靜地躺著，像睡著了一般冈闭。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上遇八，一...
開封第一講書人閱讀 51,737評(píng)論 1贊 305
城市分裂傳說
那天耍休，我揣著相機(jī)與錄音，去河邊找鬼揽碘。笑死园匹，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的裸违。我是一名探鬼主播，決...
沈念sama閱讀 40,448評(píng)論 3贊 420
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼枪汪，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼雀久！你這毒婦竟也來了趁舀？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,352評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤越庇，失蹤者是張志新（化名）和其女友劉穎奉狈，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體仁期，經(jīng)...
沈念sama閱讀 45,834評(píng)論 1贊 317
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡竭恬，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,992評(píng)論 3贊 338
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了熬的。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片萍聊。...
茶點(diǎn)故事閱讀 40,133評(píng)論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖悦析，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情此衅，我是刑警寧澤强戴，帶...
沈念sama閱讀 35,815評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站挡鞍，受9級(jí)特大地震影響骑歹，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜墨微，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,477評(píng)論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望翘县。院中可真熱鬧最域，春花似錦、人聲如沸锈麸。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,022評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽忘伞。三九已至薄翅，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間氓奈，已是汗流浹背翘魄。一陣腳步聲響...
開封第一講書人閱讀 33,147評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來泰國(guó)打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留舀奶，地道東北人暑竟。一個(gè)月前我還...
沈念sama閱讀 48,398評(píng)論 3贊 373
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像伪节，于是被迫代替她去往敵國(guó)和親光羞。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,077評(píng)論 2贊 355

kaggle 鉆石價(jià)格預(yù)測(cè)

Exploratory Data Analysis & 數(shù)據(jù)預(yù)處理

不分組knn

測(cè)試組占30% knn

線性回歸

推薦閱讀更多精彩內(nèi)容