線性回歸01-線性回歸概述以及實(shí)現(xiàn)

1.1. 線性回歸算法簡(jiǎn)介
2.2. 簡(jiǎn)單線性回歸的最小二乘法推導(dǎo)過程
1.3. 衡量線性回歸算法的指標(biāo)
1.4. 最好的衡量線性回歸法的指標(biāo) R Squared
1.5. 多元線性回歸

1.1. 線性回歸算法簡(jiǎn)介

線性回歸算法以一個(gè)坐標(biāo)系里一個(gè)維度為結(jié)果遭京，其他維度為特征（如二維平面坐標(biāo)系中橫軸為特征胃惜，縱軸為結(jié)果），無數(shù)的訓(xùn)練集放在坐標(biāo)系中哪雕，發(fā)現(xiàn)他們是圍繞著一條執(zhí)行分布船殉。線性回歸算法的期望，就是尋找一條直線斯嚎，最大程度的“擬合”樣本特征和樣本輸出標(biāo)記的關(guān)系

#### 樣本特征只有一個(gè)的線性回歸問題利虫，為簡(jiǎn)單線性回歸，如房屋價(jià)格-房屋面積

將橫坐標(biāo)作為x軸堡僻，縱坐標(biāo)作為y軸糠惫，每一個(gè)點(diǎn)為（X(i) ,y(i)）,那么我們期望尋找的直線就是y=ax+b，當(dāng)給出一個(gè)新的點(diǎn)x(j)的時(shí)候钉疫，我們希望預(yù)測(cè)的y^(j)=ax(j)+b

不使用直接相減的方式硼讽，由于差值有正有負(fù)，會(huì)抵消
不適用絕對(duì)值的方式牲阁，由于絕對(duì)值函數(shù)存在不可導(dǎo)的點(diǎn)

#### 通過上面的推導(dǎo)固阁，我們可以歸納出一類機(jī)器學(xué)習(xí)算法的基本思路壤躲，如下圖想罕；其中損失函數(shù)是計(jì)算期望值和預(yù)測(cè)值的差值咪奖，期望其差值（也就是損失）越來越小，而效用函數(shù)則是描述擬合度赁还，期望契合度越來越好

1.2. 簡(jiǎn)單線性回歸的最小二乘法推導(dǎo)過程

實(shí)現(xiàn)簡(jiǎn)單線性回歸法

import numpy as np
import matplotlib.pyplot as plt


x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])


plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()

x_mean = np.mean(x)
y_mean = np.mean(y)


num = 0.0
d = 0.0
for x_i, y_i in zip(x, y):
    num += (x_i - x_mean) * (y_i - y_mean)
    d += (x_i - x_mean) ** 2


a = num/d


b = y_mean - a * x_mean


y_hat = a * x + b


plt.scatter(x, y)
plt.plot(x, y_hat, color='r')
plt.axis([0, 6, 0, 6])
plt.show()

x_predict = 6
y_predict = a * x_predict + b
y_predict


5.2000000000000002

封裝我們自己的SimpleLinearRegression

代碼SimpleLinearRegression.py

class SimpleLinearRegression1:

    def __init__(self):
        """初始化Simple Linear Regression 模型"""
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        """根據(jù)訓(xùn)練集x_train并齐，y_train 訓(xùn)練Simple Linear Regression 模型"""
        assert x_train.ndim == 1,\
            "Simple Linear Regression can only solve simple feature training data"
        assert len(x_train) == len(y_train),\
            "the size of x_train must be equal to the size of y_train"

        ## 求均值
        x_mean = x_train.mean()
        y_mean = y_train.mean()

        ## 分子
        num = 0.0
        ## 分母
        d = 0.0

        ## 計(jì)算分子分母
        for x_i, y_i in zip(x_train, y_train):
            num += (x_i-x_mean)*(y_i-y_mean)
            d += (x_i-x_mean) ** 2

        ## 計(jì)算參數(shù)a和b
        self.a_ = num/d
        self.b_ = y_mean - self.a_ * x_mean

        return self

    def predict(self, x_predict):
        """給定待預(yù)測(cè)集x_predict漏麦，返回x_predict對(duì)應(yīng)的預(yù)測(cè)結(jié)果值"""
        assert x_predict.ndim == 1,\
            "Simple Linear Regression can only solve simple feature training data"
        assert self.a_ is not None and self.b_ is not None,\
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """給定單個(gè)待預(yù)測(cè)數(shù)據(jù)x_single，返回x_single對(duì)應(yīng)的預(yù)測(cè)結(jié)果值"""
        return self.a_*x_single+self.b_

    def __repr__(self):
        return "SimpleLinearRegression1()"


from playML.SimpleLinearRegression import SimpleLinearRegression1

reg1 = SimpleLinearRegression1()
reg1.fit(x, y)
reg1.predict(np.array([x_predict]))


array([ 5.2])


reg1.a_


0.80000000000000004


reg1.b_


0.39999999999999947


y_hat1 = reg1.predict(x)


plt.scatter(x, y)
plt.plot(x, y_hat1, color='r')
plt.axis([0, 6, 0, 6])
plt.show()

向量化

向量化實(shí)現(xiàn)SimpleLinearRegression

代碼SimpleLinearRegression.py

import numpy as np


class SimpleLinearRegression2:

    def __init__(self):
        """初始化Simple Linear Regression模型"""
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        """根據(jù)訓(xùn)練數(shù)據(jù)集x_train,y_train訓(xùn)練Simple Linear Regression模型"""
        assert x_train.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert len(x_train) == len(y_train), \
            "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        self.a_ = (x_train - x_mean).dot(y_train - y_mean) / (x_train - x_mean).dot(x_train - x_mean)
        self.b_ = y_mean - self.a_ * x_mean

        return self

    def predict(self, x_predict):
        """給定待預(yù)測(cè)數(shù)據(jù)集x_predict况褪，返回表示x_predict的結(jié)果向量"""
        assert x_predict.ndim == 1, \
            "Simple Linear Regressor can only solve single feature training data."
        assert self.a_ is not None and self.b_ is not None, \
            "must fit before predict!"

        return np.array([self._predict(x) for x in x_predict])

    def _predict(self, x_single):
        """給定單個(gè)待預(yù)測(cè)數(shù)據(jù)x_single撕贞，返回x_single的預(yù)測(cè)結(jié)果值"""
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression2()"


from playML.SimpleLinearRegression import SimpleLinearRegression2

reg2 = SimpleLinearRegression2()
reg2.fit(x, y)
reg2.predict(np.array([x_predict]))


array([ 5.2])


reg2.a_


0.80000000000000004


reg2.b_


0.39999999999999947

向量化實(shí)現(xiàn)的性能測(cè)試

m = 1000000
big_x = np.random.random(size=m)
big_y = big_x * 2 + 3 + np.random.normal(size=m)
%timeit reg1.fit(big_x, big_y)
%timeit reg2.fit(big_x, big_y)


1 loop, best of 3: 984 ms per loop
100 loops, best of 3: 18.7 ms per loop


reg1.a_


1.9998479120324177


reg1.b_


2.9989427131166595


reg2.a_


1.9998479120324153


reg2.b_


2.9989427131166604

1.3. 衡量線性回歸算法的指標(biāo)

衡量標(biāo)準(zhǔn)

其中衡量標(biāo)準(zhǔn)是和m有關(guān)的，因?yàn)樵蕉嗟臄?shù)據(jù)量產(chǎn)生的誤差和可能會(huì)更大测垛，但是毫無疑問越多的數(shù)據(jù)量訓(xùn)練出來的模型更好麻掸，為此需要一個(gè)取消誤差的方法，如下

MSE 的缺點(diǎn)赐纱，量綱不準(zhǔn)確，如果y的單位是萬元熬北，平方后就變成了萬元的平方疙描，這可能會(huì)給我們帶來一些麻煩

RMSE 平方累加后再開根號(hào)，如果某些預(yù)測(cè)結(jié)果和真實(shí)結(jié)果相差非常大讶隐，那么RMSE的結(jié)果會(huì)相對(duì)變大起胰，所以RMSE有放大誤差的趨勢(shì)，而MAE沒有巫延，他直接就反應(yīng)的是預(yù)測(cè)結(jié)果和真實(shí)結(jié)果直接的差距效五，正因如此，從某種程度上來說炉峰，想辦法我們讓RMSE變的更小小對(duì)于我們來說比較有意義畏妖，因?yàn)檫@意味著整個(gè)樣本的錯(cuò)誤中，那個(gè)最值相對(duì)比較小疼阔，而且我們之前訓(xùn)練樣本的目標(biāo)戒劫，就是RMSE根號(hào)里面1/m的這一部分，而這一部分的本質(zhì)和優(yōu)化RMSE是一樣的.

衡量回歸算法的標(biāo)準(zhǔn)婆廊，MSE vs MAE

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

波士頓房產(chǎn)數(shù)據(jù)

boston = datasets.load_boston()


boston.keys()


dict_keys(['data', 'target', 'feature_names', 'DESCR'])


print(boston.DESCR)


    Boston House Prices dataset
    ===========================

    Notes
    ------
    Data Set Characteristics:  

        :Number of Instances: 506 

        :Number of Attributes: 13 numeric/categorical predictive

        :Median Value (attribute 14) is usually the target

        :Attribute Information (in order):
            - CRIM     per capita crime rate by town
            - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
            - INDUS    proportion of non-retail business acres per town
            - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
            - NOX      nitric oxides concentration (parts per 10 million)
            - RM       average number of rooms per dwelling
            - AGE      proportion of owner-occupied units built prior to 1940
            - DIS      weighted distances to five Boston employment centres
            - RAD      index of accessibility to radial highways
            - TAX      full-value property-tax rate per $10,000
            - PTRATIO  pupil-teacher ratio by town
            - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
            - LSTAT    % lower status of the population
            - MEDV     Median value of owner-occupied homes in $1000's

        :Missing Attribute Values: None

        :Creator: Harrison, D. and Rubinfeld, D.L.

    This is a copy of UCI ML housing dataset.
    http://archive.ics.uci.edu/ml/datasets/Housing


?    
?    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
?    
    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.

    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   

    **References**

       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
       - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


boston.feature_names


array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='<U7')


x = boston.data[:,5] ## 只使用房間數(shù)量這個(gè)特征


x.shape


(506,)


y = boston.target


y.shape


(506,)


plt.scatter(x, y)
plt.show()

np.max(y)


50.0


x = x[y < 50.0]
y = y[y < 50.0]


x.shape


(490,)


y.shape


(490,)


plt.scatter(x, y)
plt.show()

使用簡(jiǎn)單線性回歸法

from playML.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)


x_train.shape


(392,)


y_train.shape


(392,)


x_test.shape


(98,)


y_test.shape


(98,)


from playML.SimpleLinearRegression import SimpleLinearRegression


reg = SimpleLinearRegression()
reg.fit(x_train, y_train)


SimpleLinearRegression()


reg.a_


7.8608543562689555


reg.b_


-27.459342806705543


plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), color='r')
plt.show()

plt.scatter(x_train, y_train)
plt.scatter(x_test, y_test, color="c")
plt.plot(x_train, reg.predict(x_train), color='r')
plt.show()

y_predict = reg.predict(x_test)

MSE

mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
mse_test


24.156602134387438

RMSE

from math import sqrt

rmse_test = sqrt(mse_test)
rmse_test


4.914936635846635

MAE

mae_test = np.sum(np.absolute(y_predict - y_test))/len(y_test)
mae_test


3.5430974409463873

封裝我們自己的評(píng)測(cè)函數(shù)

代碼:

import numpy as np
from math import sqrt


def accuracy_score(y_true, y_predict):
    """計(jì)算y_true和y_predict之間的準(zhǔn)確率"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum(y_true == y_predict) / len(y_true)


def mean_squared_error(y_true, y_predict):
    """計(jì)算y_true和y_predict之間的MSE"""
    assert len(y_true) == len(y_predict), \
        "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_true - y_predict)**2) / len(y_true)


def root_mean_squared_error(y_true, y_predict):
    """計(jì)算y_true和y_predict之間的RMSE"""

    return sqrt(mean_squared_error(y_true, y_predict))


def mean_absolute_error(y_true, y_predict):
    """計(jì)算y_true和y_predict之間的MAE"""

    return np.sum(np.absolute(y_true - y_predict)) / len(y_true)


from playML.metrics import mean_squared_error
from playML.metrics import root_mean_squared_error
from playML.metrics import mean_absolute_error


mean_squared_error(y_test, y_predict)


24.156602134387438


root_mean_squared_error(y_test, y_predict)


4.914936635846635


mean_absolute_error(y_test, y_predict)


3.5430974409463873

scikit-learn中的MSE和MAE

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


mean_squared_error(y_test, y_predict)


24.156602134387438


mean_absolute_error(y_test, y_predict)


3.5430974409463873

1.4. 最好的衡量線性回歸法的指標(biāo) R Squared

RMSE 和 MAE的局限性

可能預(yù)測(cè)房源準(zhǔn)確度迅细，RMSE或者M(jìn)AE的值為5，預(yù)測(cè)學(xué)生的分?jǐn)?shù)淘邻，結(jié)果的誤差是10茵典，這個(gè)5和10沒有判斷性，因?yàn)?和10對(duì)應(yīng)不同的單位和量綱宾舅，無法比較

解決辦法-R Squared簡(jiǎn)介

R Squared 意義

使用BaseLine Model產(chǎn)生的錯(cuò)誤會(huì)很大统阿，使用我們的模型預(yù)測(cè)產(chǎn)生的錯(cuò)誤會(huì)相對(duì)少些（因?yàn)槲覀兊哪Ｐ统浞值目紤]了y和x之間的關(guān)系）彩倚，用這兩者相減，結(jié)果就是擬合了我們的錯(cuò)誤指標(biāo)砂吞，用1減去這個(gè)商結(jié)果就是我們的模型沒有產(chǎn)生錯(cuò)誤的指標(biāo)

實(shí)現(xiàn) R Squared (R^2)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets


boston = datasets.load_boston()
x = boston.data[:,5] ## 只使用房間數(shù)量這個(gè)特征
y = boston.target

x = x[y < 50.0]
y = y[y < 50.0]


from playML.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)


from playML.SimpleLinearRegression import SimpleLinearRegression

reg = SimpleLinearRegression()
reg.fit(x_train, y_train)


SimpleLinearRegression()


reg.a_


7.8608543562689555


reg.b_


-27.459342806705543


y_predict = reg.predict(x_test)

R Square

from playML.metrics import mean_squared_error

1 - mean_squared_error(y_test, y_predict)/np.var(y_test)


---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-a7a5d5c1ca17> in <module>()
      1 from playML.metrics import mean_squared_error
      2 
----> 3 1 - mean_squared_error(y_test, y_predict)/np.var(y_test)

NameError: name 'y_test' is not defined

封裝我們自己的 R Score

代碼(playML/metrics.py)

def r2_score(y_true, y_predict):
    """計(jì)算y_true和y_predict之間的R Square"""

    return 1 - mean_squared_error(y_true, y_predict)/np.var(y_true)


from playML.metrics import r2_score

r2_score(y_test, y_predict)


0.61293168039373225

scikit-learn中的 r2_score

from sklearn.metrics import r2_score

r2_score(y_test, y_predict)


0.61293168039373236

scikit-learn中的LinearRegression中的score返回r2_score:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

在我們的SimpleRegression中添加score

import numpy as np
from .metrics import r2_score


class SimpleLinearRegression:



    def score(self, x_test, y_test):
        """根據(jù)測(cè)試數(shù)據(jù)集 x_test 和 y_test 確定當(dāng)前模型的準(zhǔn)確度"""

        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)


reg.score(x_test, y_test)


0.61293168039373225

1.5. 多元線性回歸

多元線性回歸簡(jiǎn)介和正規(guī)方程解

補(bǔ)充（矩陣點(diǎn)乘：A（m行）·B（n列） = A的每一行與B的每一列相乘再相加署恍，等到結(jié)果是m行n列的）

補(bǔ)充（一個(gè)1xm的行向量乘以一個(gè)mx1的列向量等于一個(gè)數(shù)）

多元線性回歸公式推導(dǎo)過程

## 基礎(chǔ)知識(shí)

## 多元線性回歸公式推導(dǎo)過程

多元線性回歸實(shí)現(xiàn)

實(shí)現(xiàn)我們自己的 Linear Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets


boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]


X.shape


(490, 13)


from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

使用我們自己制作 Linear Regression

代碼playML/LinearRegression.py

import numpy as np
from .metrics import r2_score


class LinearRegression:

    def __init__(self):
        """初始化Linear Regression模型"""

        ## 系數(shù)向量（θ1,θ2,.....θn）
        self.coef_ = None
        ## 截距 (θ0)
        self.interception_ = None
        ## θ向量
        self._theta = None

    def fit_normal(self, X_train, y_train):
        """根據(jù)訓(xùn)練數(shù)據(jù)集X_train，y_train 訓(xùn)練Linear Regression模型"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        ## np.ones((len(X_train), 1)) 構(gòu)造一個(gè)和X_train 同樣行數(shù)的蜻直，只有一列的全是1的矩陣
        ## np.hstack 拼接矩陣
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        ## X_b.T 獲取矩陣的轉(zhuǎn)置
        ## np.linalg.inv() 獲取矩陣的逆
        ## dot() 矩陣點(diǎn)乘
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict(self, X_predict):
        """給定待預(yù)測(cè)數(shù)據(jù)集X_predict盯质，返回表示X_predict的結(jié)果向量"""
        assert self.coef_ is not None and self.interception_ is not None,\
            "must fit before predict"
        assert X_predict.shape[1] == len(self.coef_),\
            "the feature number of X_predict must be equal to X_train"

        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        """根據(jù)測(cè)試數(shù)據(jù)集 X_test 和 y_test 確定當(dāng)前模型的準(zhǔn)確度"""

        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "LinearRegression()"


from playML.LinearRegression import LinearRegression

reg = LinearRegression()
reg.fit_normal(X_train, y_train)


LinearRegression()


reg.coef_


array([ -1.18919477e-01,   3.63991462e-02,  -3.56494193e-02,
         5.66737830e-02,  -1.16195486e+01,   3.42022185e+00,
        -2.31470282e-02,  -1.19509560e+00,   2.59339091e-01,
        -1.40112724e-02,  -8.36521175e-01,   7.92283639e-03,
        -3.81966137e-01])


reg.intercept_


34.161435496224712


reg.score(X_test, y_test)


0.81298026026584658

09 scikit-learn中的回歸問題

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets


boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]


X.shape


(490, 13)


from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)

scikit-learn中的線性回歸

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)


lin_reg.coef_


array([ -1.18919477e-01,   3.63991462e-02,  -3.56494193e-02,
         5.66737830e-02,  -1.16195486e+01,   3.42022185e+00,
        -2.31470282e-02,  -1.19509560e+00,   2.59339091e-01,
        -1.40112724e-02,  -8.36521175e-01,   7.92283639e-03,
        -3.81966137e-01])


lin_reg.intercept_


34.161435496246924


lin_reg.score(X_test, y_test)


0.81298026026584758

kNN Regressor

from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train, y_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)


from sklearn.neighbors import KNeighborsRegressor

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train_standard, y_train)
knn_reg.score(X_test_standard, y_test)


0.84664511530389497


from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        "weights": ["uniform"],
        "n_neighbors": [i for i in range(1, 11)]
    },
    {
        "weights": ["distance"],
        "n_neighbors": [i for i in range(1, 11)],
        "p": [i for i in range(1,6)]
    }
]

knn_reg = KNeighborsRegressor()
grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
grid_search.fit(X_train_standard, y_train)


Fitting 3 folds for each of 60 candidates, totalling 180 fits

[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished




GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)


grid_search.best_params_


{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}


grid_search.best_score_


0.79917999890996905


grid_search.best_estimator_.score(X_test_standard, y_test)


0.88099665099417701

10 線性回歸參數(shù)的可解釋性

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets


boston = datasets.load_boston()

X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]


from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)


lin_reg.coef_


array([ -1.05574295e-01,   3.52748549e-02,  -4.35179251e-02,
         4.55405227e-01,  -1.24268073e+01,   3.75411229e+00,
        -2.36116881e-02,  -1.21088069e+00,   2.50740082e-01,
        -1.37702943e-02,  -8.38888137e-01,   7.93577159e-03,
        -3.50952134e-01])


np.argsort(lin_reg.coef_)


array([ 4,  7, 10, 12,  0,  2,  6,  9, 11,  1,  8,  3,  5])


boston.feature_names[np.argsort(lin_reg.coef_)]


array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
       'B', 'ZN', 'RAD', 'CHAS', 'RM'], 
      dtype='<U7')


print(boston.DESCR)


    Boston House Prices dataset
    ===========================

    Notes
    ------
    Data Set Characteristics:  

        :Number of Instances: 506 

        :Number of Attributes: 13 numeric/categorical predictive

        :Median Value (attribute 14) is usually the target

        :Attribute Information (in order):
            - CRIM     per capita crime rate by town
            - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
            - INDUS    proportion of non-retail business acres per town
            - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
            - NOX      nitric oxides concentration (parts per 10 million)
            - RM       average number of rooms per dwelling
            - AGE      proportion of owner-occupied units built prior to 1940
            - DIS      weighted distances to five Boston employment centres
            - RAD      index of accessibility to radial highways
            - TAX      full-value property-tax rate per $10,000
            - PTRATIO  pupil-teacher ratio by town
            - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
            - LSTAT    % lower status of the population
            - MEDV     Median value of owner-occupied homes in $1000's

        :Missing Attribute Values: None

        :Creator: Harrison, D. and Rubinfeld, D.L.

    This is a copy of UCI ML housing dataset.
    http://archive.ics.uci.edu/ml/datasets/Housing


?    
    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.

    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   

    **References**

       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
       - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

RM對(duì)應(yīng)的是房間數(shù)，是正相關(guān)最大的特征概而，也就是說房間數(shù)越多呼巷，房?jī)r(jià)越高，這是很合理的 NOX對(duì)應(yīng)的是一氧化氮濃度赎瑰，也就是說一氧化氮濃度越低王悍，房?jī)r(jià)越低，這也是非常合理的由此說明餐曼，我們的線性回歸具有可解釋性压储，我們可以在對(duì)研究一個(gè)模型的時(shí)候，可以先用線性回歸模型看一下源譬，然后根據(jù)感性的認(rèn)識(shí)去直觀的判斷一下是否符合我們的語氣

最后編輯于：2020.08.25 15:12:35

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末集惋，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子踩娘，更是在濱河造成了極大的恐慌刮刑，老刑警劉巖，帶你破解...
沈念sama閱讀 218,386評(píng)論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件养渴，死亡現(xiàn)場(chǎng)離奇詭異雷绢，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)理卑，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,142評(píng)論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門翘紊，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人傻工，你說我怎么就攤上這事霞溪。” “怎么了中捆？”我有些...
開封第一講書人閱讀 164,704評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵鸯匹，是天一觀的道長(zhǎng)。經(jīng)常有香客問我泄伪，道長(zhǎng)殴蓬，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,702評(píng)論 1贊 294
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮染厅，結(jié)果婚禮上痘绎，老公的妹妹穿的比我還像新娘。我一直安慰自己肖粮，他們只是感情好孤页，可當(dāng)我...
茶點(diǎn)故事閱讀 67,716評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著涩馆，像睡著了一般行施。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上魂那，一...
開封第一講書人閱讀 51,573評(píng)論 1贊 305
城市分裂傳說
那天蛾号，我揣著相機(jī)與錄音，去河邊找鬼涯雅。笑死鲜结，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的活逆。我是一名探鬼主播精刷，決...
沈念sama閱讀 40,314評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼蔗候！你這毒婦竟也來了贬养？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 39,230評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤琴庵，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后仰美，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體迷殿，經(jīng)...
沈念sama閱讀 45,680評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,873評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年咖杂，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了庆寺。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 39,991評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡诉字，死狀恐怖懦尝，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情壤圃，我是刑警寧澤陵霉，帶...
沈念sama閱讀 35,706評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站伍绳，受9級(jí)特大地震影響踊挠，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜冲杀，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,329評(píng)論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一效床、第九天我趴在偏房一處隱蔽的房頂上張望睹酌。院中可真熱鬧，春花似錦剩檀、人聲如沸憋沿。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,910評(píng)論 0贊 22
一樁弒父案沪猴，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽辐啄。三九已至，卻和暖如春字币，著一層夾襖步出監(jiān)牢的瞬間则披，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,038評(píng)論 1贊 270
情欲美人皮
我被黑心中介騙來泰國(guó)打工洗出，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留士复，地道東北人。一個(gè)月前我還...
沈念sama閱讀 48,158評(píng)論 3贊 370
代替公主和親
正文我出身青樓翩活，卻偏偏與公主長(zhǎng)得像阱洪，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子菠镇，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,941評(píng)論 2贊 355