- 1.1. 線性回歸算法簡(jiǎn)介
- 2.2. 簡(jiǎn)單線性回歸的最小二乘法推導(dǎo)過程
- 1.3. 衡量線性回歸算法的指標(biāo)
- 1.4. 最好的衡量線性回歸法的指標(biāo) R Squared
- 1.5. 多元線性回歸
1.1. 線性回歸算法簡(jiǎn)介
線性回歸算法以一個(gè)坐標(biāo)系里一個(gè)維度為結(jié)果遭京,其他維度為特征(如二維平面坐標(biāo)系中橫軸為特征胃惜,縱軸為結(jié)果),無數(shù)的訓(xùn)練集放在坐標(biāo)系中哪雕,發(fā)現(xiàn)他們是圍繞著一條執(zhí)行分布船殉。線性回歸算法的期望,就是尋找一條直線斯嚎,最大程度的“擬合”樣本特征和樣本輸出標(biāo)記的關(guān)系
#### 樣本特征只有一個(gè)的線性回歸問題利虫,為簡(jiǎn)單線性回歸,如房屋價(jià)格-房屋面積
將橫坐標(biāo)作為x軸堡僻,縱坐標(biāo)作為y軸糠惫,每一個(gè)點(diǎn)為(X(i) ,y(i)),那么我們期望尋找的直線就是y=ax+b,當(dāng)給出一個(gè)新的點(diǎn)x(j)的時(shí)候钉疫,我們希望預(yù)測(cè)的y^(j)=ax(j)+b
- 不使用直接相減的方式硼讽,由于差值有正有負(fù),會(huì)抵消
- 不適用絕對(duì)值的方式牲阁,由于絕對(duì)值函數(shù)存在不可導(dǎo)的點(diǎn)
#### 通過上面的推導(dǎo)固阁,我們可以歸納出一類機(jī)器學(xué)習(xí)算法的基本思路壤躲,如下圖想罕;其中損失函數(shù)是計(jì)算期望值和預(yù)測(cè)值的差值咪奖,期望其差值(也就是損失)越來越小,而效用函數(shù)則是描述擬合度赁还,期望契合度越來越好
1.2. 簡(jiǎn)單線性回歸的最小二乘法推導(dǎo)過程
實(shí)現(xiàn)簡(jiǎn)單線性回歸法
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])
plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()
x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0.0
d = 0.0
for x_i, y_i in zip(x, y):
num += (x_i - x_mean) * (y_i - y_mean)
d += (x_i - x_mean) ** 2
a = num/d
b = y_mean - a * x_mean
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, color='r')
plt.axis([0, 6, 0, 6])
plt.show()
x_predict = 6
y_predict = a * x_predict + b
y_predict
5.2000000000000002
封裝我們自己的SimpleLinearRegression
代碼SimpleLinearRegression.py
class SimpleLinearRegression1:
def __init__(self):
"""初始化Simple Linear Regression 模型"""
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
"""根據(jù)訓(xùn)練集x_train并齐,y_train 訓(xùn)練Simple Linear Regression 模型"""
assert x_train.ndim == 1,\
"Simple Linear Regression can only solve simple feature training data"
assert len(x_train) == len(y_train),\
"the size of x_train must be equal to the size of y_train"
## 求均值
x_mean = x_train.mean()
y_mean = y_train.mean()
## 分子
num = 0.0
## 分母
d = 0.0
## 計(jì)算分子分母
for x_i, y_i in zip(x_train, y_train):
num += (x_i-x_mean)*(y_i-y_mean)
d += (x_i-x_mean) ** 2
## 計(jì)算參數(shù)a和b
self.a_ = num/d
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""給定待預(yù)測(cè)集x_predict漏麦,返回x_predict對(duì)應(yīng)的預(yù)測(cè)結(jié)果值"""
assert x_predict.ndim == 1,\
"Simple Linear Regression can only solve simple feature training data"
assert self.a_ is not None and self.b_ is not None,\
"must fit before predict!"
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
"""給定單個(gè)待預(yù)測(cè)數(shù)據(jù)x_single,返回x_single對(duì)應(yīng)的預(yù)測(cè)結(jié)果值"""
return self.a_*x_single+self.b_
def __repr__(self):
return "SimpleLinearRegression1()"
from playML.SimpleLinearRegression import SimpleLinearRegression1
reg1 = SimpleLinearRegression1()
reg1.fit(x, y)
reg1.predict(np.array([x_predict]))
array([ 5.2])
reg1.a_
0.80000000000000004
reg1.b_
0.39999999999999947
y_hat1 = reg1.predict(x)
plt.scatter(x, y)
plt.plot(x, y_hat1, color='r')
plt.axis([0, 6, 0, 6])
plt.show()
向量化
向量化實(shí)現(xiàn)SimpleLinearRegression
代碼SimpleLinearRegression.py
import numpy as np
class SimpleLinearRegression2:
def __init__(self):
"""初始化Simple Linear Regression模型"""
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
"""根據(jù)訓(xùn)練數(shù)據(jù)集x_train,y_train訓(xùn)練Simple Linear Regression模型"""
assert x_train.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert len(x_train) == len(y_train), \
"the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
self.a_ = (x_train - x_mean).dot(y_train - y_mean) / (x_train - x_mean).dot(x_train - x_mean)
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""給定待預(yù)測(cè)數(shù)據(jù)集x_predict况褪,返回表示x_predict的結(jié)果向量"""
assert x_predict.ndim == 1, \
"Simple Linear Regressor can only solve single feature training data."
assert self.a_ is not None and self.b_ is not None, \
"must fit before predict!"
return np.array([self._predict(x) for x in x_predict])
def _predict(self, x_single):
"""給定單個(gè)待預(yù)測(cè)數(shù)據(jù)x_single撕贞,返回x_single的預(yù)測(cè)結(jié)果值"""
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression2()"
from playML.SimpleLinearRegression import SimpleLinearRegression2
reg2 = SimpleLinearRegression2()
reg2.fit(x, y)
reg2.predict(np.array([x_predict]))
array([ 5.2])
reg2.a_
0.80000000000000004
reg2.b_
0.39999999999999947
向量化實(shí)現(xiàn)的性能測(cè)試
m = 1000000
big_x = np.random.random(size=m)
big_y = big_x * 2 + 3 + np.random.normal(size=m)
%timeit reg1.fit(big_x, big_y)
%timeit reg2.fit(big_x, big_y)
1 loop, best of 3: 984 ms per loop
100 loops, best of 3: 18.7 ms per loop
reg1.a_
1.9998479120324177
reg1.b_
2.9989427131166595
reg2.a_
1.9998479120324153
reg2.b_
2.9989427131166604
1.3. 衡量線性回歸算法的指標(biāo)
衡量標(biāo)準(zhǔn)
其中衡量標(biāo)準(zhǔn)是和m有關(guān)的,因?yàn)樵蕉嗟臄?shù)據(jù)量產(chǎn)生的誤差和可能會(huì)更大测垛,但是毫無疑問越多的數(shù)據(jù)量訓(xùn)練出來的模型更好麻掸,為此需要一個(gè)取消誤差的方法,如下
MSE 的缺點(diǎn)赐纱,量綱不準(zhǔn)確,如果y的單位是萬元熬北,平方后就變成了萬元的平方疙描,這可能會(huì)給我們帶來一些麻煩
RMSE 平方累加后再開根號(hào),如果某些預(yù)測(cè)結(jié)果和真實(shí)結(jié)果相差非常大讶隐,那么RMSE的結(jié)果會(huì)相對(duì)變大起胰,所以RMSE有放大誤差的趨勢(shì),而MAE沒有巫延,他直接就反應(yīng)的是預(yù)測(cè)結(jié)果和真實(shí)結(jié)果直接的差距效五,正因如此,從某種程度上來說炉峰,想辦法我們讓RMSE變的更小小對(duì)于我們來說比較有意義畏妖,因?yàn)檫@意味著整個(gè)樣本的錯(cuò)誤中,那個(gè)最值相對(duì)比較小疼阔,而且我們之前訓(xùn)練樣本的目標(biāo)戒劫,就是RMSE根號(hào)里面1/m的這一部分,而這一部分的本質(zhì)和優(yōu)化RMSE是一樣的.
衡量回歸算法的標(biāo)準(zhǔn)婆廊,MSE vs MAE
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
波士頓房產(chǎn)數(shù)據(jù)
boston = datasets.load_boston()
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
print(boston.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
?
? This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
?
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'],
dtype='<U7')
x = boston.data[:,5] ## 只使用房間數(shù)量這個(gè)特征
x.shape
(506,)
y = boston.target
y.shape
(506,)
plt.scatter(x, y)
plt.show()
np.max(y)
50.0
x = x[y < 50.0]
y = y[y < 50.0]
x.shape
(490,)
y.shape
(490,)
plt.scatter(x, y)
plt.show()
使用簡(jiǎn)單線性回歸法
from playML.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
x_train.shape
(392,)
y_train.shape
(392,)
x_test.shape
(98,)
y_test.shape
(98,)
from playML.SimpleLinearRegression import SimpleLinearRegression
reg = SimpleLinearRegression()
reg.fit(x_train, y_train)
SimpleLinearRegression()
reg.a_
7.8608543562689555
reg.b_
-27.459342806705543
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color='r')
plt.show()
plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test, color="c")
plt.plot(x_train, reg.predict(x_train), color='r')
plt.show()
y_predict = reg.predict(x_test)
MSE
mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
mse_test
24.156602134387438
RMSE
from math import sqrt
rmse_test = sqrt(mse_test)
rmse_test
4.914936635846635
MAE
mae_test = np.sum(np.absolute(y_predict - y_test))/len(y_test)
mae_test
3.5430974409463873
封裝我們自己的評(píng)測(cè)函數(shù)
代碼:
import numpy as np
from math import sqrt
def accuracy_score(y_true, y_predict):
"""計(jì)算y_true和y_predict之間的準(zhǔn)確率"""
assert len(y_true) == len(y_predict), \
"the size of y_true must be equal to the size of y_predict"
return np.sum(y_true == y_predict) / len(y_true)
def mean_squared_error(y_true, y_predict):
"""計(jì)算y_true和y_predict之間的MSE"""
assert len(y_true) == len(y_predict), \
"the size of y_true must be equal to the size of y_predict"
return np.sum((y_true - y_predict)**2) / len(y_true)
def root_mean_squared_error(y_true, y_predict):
"""計(jì)算y_true和y_predict之間的RMSE"""
return sqrt(mean_squared_error(y_true, y_predict))
def mean_absolute_error(y_true, y_predict):
"""計(jì)算y_true和y_predict之間的MAE"""
return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
from playML.metrics import mean_squared_error
from playML.metrics import root_mean_squared_error
from playML.metrics import mean_absolute_error
mean_squared_error(y_test, y_predict)
24.156602134387438
root_mean_squared_error(y_test, y_predict)
4.914936635846635
mean_absolute_error(y_test, y_predict)
3.5430974409463873
scikit-learn中的MSE和MAE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
mean_squared_error(y_test, y_predict)
24.156602134387438
mean_absolute_error(y_test, y_predict)
3.5430974409463873
1.4. 最好的衡量線性回歸法的指標(biāo) R Squared
RMSE 和 MAE的局限性
可能預(yù)測(cè)房源準(zhǔn)確度迅细,RMSE或者M(jìn)AE的值為5,預(yù)測(cè)學(xué)生的分?jǐn)?shù)淘邻,結(jié)果的誤差是10茵典,這個(gè)5和10沒有判斷性,因?yàn)?和10對(duì)應(yīng)不同的單位和量綱宾舅,無法比較
解決辦法-R Squared簡(jiǎn)介
R Squared 意義
使用BaseLine Model產(chǎn)生的錯(cuò)誤會(huì)很大统阿,使用我們的模型預(yù)測(cè)產(chǎn)生的錯(cuò)誤會(huì)相對(duì)少些(因?yàn)槲覀兊哪P统浞值目紤]了y和x之間的關(guān)系)彩倚,用這兩者相減,結(jié)果就是擬合了我們的錯(cuò)誤指標(biāo)砂吞,用1減去這個(gè)商結(jié)果就是我們的模型沒有產(chǎn)生錯(cuò)誤的指標(biāo)
實(shí)現(xiàn) R Squared (R^2)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
x = boston.data[:,5] ## 只使用房間數(shù)量這個(gè)特征
y = boston.target
x = x[y < 50.0]
y = y[y < 50.0]
from playML.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
from playML.SimpleLinearRegression import SimpleLinearRegression
reg = SimpleLinearRegression()
reg.fit(x_train, y_train)
SimpleLinearRegression()
reg.a_
7.8608543562689555
reg.b_
-27.459342806705543
y_predict = reg.predict(x_test)
R Square
from playML.metrics import mean_squared_error
1 - mean_squared_error(y_test, y_predict)/np.var(y_test)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-a7a5d5c1ca17> in <module>()
1 from playML.metrics import mean_squared_error
2
----> 3 1 - mean_squared_error(y_test, y_predict)/np.var(y_test)
NameError: name 'y_test' is not defined
封裝我們自己的 R Score
代碼(playML/metrics.py)
def r2_score(y_true, y_predict):
"""計(jì)算y_true和y_predict之間的R Square"""
return 1 - mean_squared_error(y_true, y_predict)/np.var(y_true)
from playML.metrics import r2_score
r2_score(y_test, y_predict)
0.61293168039373225
scikit-learn中的 r2_score
from sklearn.metrics import r2_score
r2_score(y_test, y_predict)
0.61293168039373236
scikit-learn中的LinearRegression中的score返回r2_score:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
在我們的SimpleRegression中添加score
import numpy as np
from .metrics import r2_score
class SimpleLinearRegression:
def score(self, x_test, y_test):
"""根據(jù)測(cè)試數(shù)據(jù)集 x_test 和 y_test 確定當(dāng)前模型的準(zhǔn)確度"""
y_predict = self.predict(x_test)
return r2_score(y_test, y_predict)
reg.score(x_test, y_test)
0.61293168039373225
1.5. 多元線性回歸
多元線性回歸簡(jiǎn)介和正規(guī)方程解
補(bǔ)充(矩陣點(diǎn)乘:A(m行)·B(n列) = A的每一行與B的每一列相乘再相加署恍,等到結(jié)果是m行n列的)
補(bǔ)充(一個(gè)1xm的行向量乘以一個(gè)mx1的列向量等于一個(gè)數(shù))
多元線性回歸公式推導(dǎo)過程
## 基礎(chǔ)知識(shí)
## 多元線性回歸公式推導(dǎo)過程
多元線性回歸實(shí)現(xiàn)
實(shí)現(xiàn)我們自己的 Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
X.shape
(490, 13)
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
使用我們自己制作 Linear Regression
代碼playML/LinearRegression.py
import numpy as np
from .metrics import r2_score
class LinearRegression:
def __init__(self):
"""初始化Linear Regression模型"""
## 系數(shù)向量(θ1,θ2,.....θn)
self.coef_ = None
## 截距 (θ0)
self.interception_ = None
## θ向量
self._theta = None
def fit_normal(self, X_train, y_train):
"""根據(jù)訓(xùn)練數(shù)據(jù)集X_train,y_train 訓(xùn)練Linear Regression模型"""
assert X_train.shape[0] == y_train.shape[0], \
"the size of X_train must be equal to the size of y_train"
## np.ones((len(X_train), 1)) 構(gòu)造一個(gè)和X_train 同樣行數(shù)的蜻直,只有一列的全是1的矩陣
## np.hstack 拼接矩陣
X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
## X_b.T 獲取矩陣的轉(zhuǎn)置
## np.linalg.inv() 獲取矩陣的逆
## dot() 矩陣點(diǎn)乘
self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
self.interception_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
def predict(self, X_predict):
"""給定待預(yù)測(cè)數(shù)據(jù)集X_predict盯质,返回表示X_predict的結(jié)果向量"""
assert self.coef_ is not None and self.interception_ is not None,\
"must fit before predict"
assert X_predict.shape[1] == len(self.coef_),\
"the feature number of X_predict must be equal to X_train"
X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
return X_b.dot(self._theta)
def score(self, X_test, y_test):
"""根據(jù)測(cè)試數(shù)據(jù)集 X_test 和 y_test 確定當(dāng)前模型的準(zhǔn)確度"""
y_predict = self.predict(X_test)
return r2_score(y_test, y_predict)
def __repr__(self):
return "LinearRegression()"
from playML.LinearRegression import LinearRegression
reg = LinearRegression()
reg.fit_normal(X_train, y_train)
LinearRegression()
reg.coef_
array([ -1.18919477e-01, 3.63991462e-02, -3.56494193e-02,
5.66737830e-02, -1.16195486e+01, 3.42022185e+00,
-2.31470282e-02, -1.19509560e+00, 2.59339091e-01,
-1.40112724e-02, -8.36521175e-01, 7.92283639e-03,
-3.81966137e-01])
reg.intercept_
34.161435496224712
reg.score(X_test, y_test)
0.81298026026584658
09 scikit-learn中的回歸問題
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
X.shape
(490, 13)
from playML.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
scikit-learn中的線性回歸
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lin_reg.coef_
array([ -1.18919477e-01, 3.63991462e-02, -3.56494193e-02,
5.66737830e-02, -1.16195486e+01, 3.42022185e+00,
-2.31470282e-02, -1.19509560e+00, 2.59339091e-01,
-1.40112724e-02, -8.36521175e-01, 7.92283639e-03,
-3.81966137e-01])
lin_reg.intercept_
34.161435496246924
lin_reg.score(X_test, y_test)
0.81298026026584758
kNN Regressor
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train, y_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)
from sklearn.neighbors import KNeighborsRegressor
knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train_standard, y_train)
knn_reg.score(X_test_standard, y_test)
0.84664511530389497
from sklearn.model_selection import GridSearchCV
param_grid = [
{
"weights": ["uniform"],
"n_neighbors": [i for i in range(1, 11)]
},
{
"weights": ["distance"],
"n_neighbors": [i for i in range(1, 11)],
"p": [i for i in range(1,6)]
}
]
knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train_standard, y_train)
Fitting 3 folds for each of 60 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 1.5s finished
GridSearchCV(cv=None, error_score='raise',
estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
fit_params={}, iid=True, n_jobs=-1,
param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=1)
grid_search.best_params_
{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
grid_search.best_score_
0.79917999890996905
grid_search.best_estimator_.score(X_test_standard, y_test)
0.88099665099417701
10 線性回歸參數(shù)的可解釋性
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
X = X[y < 50.0]
y = y[y < 50.0]
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lin_reg.coef_
array([ -1.05574295e-01, 3.52748549e-02, -4.35179251e-02,
4.55405227e-01, -1.24268073e+01, 3.75411229e+00,
-2.36116881e-02, -1.21088069e+00, 2.50740082e-01,
-1.37702943e-02, -8.38888137e-01, 7.93577159e-03,
-3.50952134e-01])
np.argsort(lin_reg.coef_)
array([ 4, 7, 10, 12, 0, 2, 6, 9, 11, 1, 8, 3, 5])
boston.feature_names[np.argsort(lin_reg.coef_)]
array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
'B', 'ZN', 'RAD', 'CHAS', 'RM'],
dtype='<U7')
print(boston.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
?
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
RM對(duì)應(yīng)的是房間數(shù),是正相關(guān)最大的特征概而,也就是說房間數(shù)越多呼巷,房?jī)r(jià)越高,這是很合理的 NOX對(duì)應(yīng)的是一氧化氮濃度赎瑰,也就是說一氧化氮濃度越低王悍,房?jī)r(jià)越低,這也是非常合理的 由此說明餐曼,我們的線性回歸具有可解釋性压储,我們可以在對(duì)研究一個(gè)模型的時(shí)候,可以先用線性回歸模型看一下源譬,然后根據(jù)感性的認(rèn)識(shí)去直觀的判斷一下是否符合我們的語氣