1.介紹
對于線性回歸招驴,為了增加模型的泛化能力裤唠,我們可以采用嶺回歸(Ridge Regression)而引入正則項。當然泥张,最好的辦法還是增加訓練樣本呵恢,之后我們會看到增大樣本后的效果。為了方便以后使用媚创,我們這次按照面向對象思想將嶺回歸模型進行封裝(class Ridge_Regression)渗钉。
2.樣本生成
與之前一樣,我們采用sklearn的datasets模塊產生數(shù)據(jù),我們將其定義在init方法內。
def __init__(
self,
n_samples=50,
n_features=10,
n_informative=2,
n_targets=5,
noise=30,
bias=10,
random_state=1,
x=None,
y=None
):
if (x is None) and (y is None):
self.X, self.y = datasets.make_regression(
n_samples=n_samples,
n_features=n_features,
n_informative=n_informative,
n_targets=n_targets,
noise=noise,
bias=bias,
random_state=random_state
)
self.y = self.y.reshape(n_samples, n_targets)
elif (x is not None) and (y is not None):
self.X = np.array(x)
self.y = np.array(y)
else:
raise Exception("Input is invalid!")
3.數(shù)據(jù)預處理
將樣本打亂順序鳄橘,同時分隔數(shù)據(jù)集声离。
def preprocess(self, proportion=0.8):
n = self.X.shape[0] # training samples
n_train = int(n * proportion)
permutation = np.random.permutation(n)
self.X, self.y = self.X[permutation], self.y[permutation]
self.X_train, self.y_train = self.X[:n_train, :], self.y[:n_train, :]
self.X_test, self.y_test = self.X[n_train:, :], self.y[n_train:, :]
return self.X_train, self.y_train, self.X_test, self.y_test
4.模型訓練
這里關注alpha參數(shù),表示正則系數(shù)瘫怜。alpha越大术徊,正則懲罰越厲害,alpha=0就是普通的線性回歸模型鲸湃。
def train(self, alpha=0.01, fit_intercept=True, max_iter=10000):
self.ridge_regressor = linear_model.Ridge(alpha=alpha, fit_intercept=fit_intercept,
max_iter=max_iter)
self.ridge_regressor.fit(self.X_train, self.y_train)
5.預測與評價
與之前線性回歸一樣赠涮,不多講。
def predict(self, X):
return self.ridge_regressor.predict(X)
def loss(self, X, y):
return round(sm.mean_squared_error(X, y), 4)
def variance(self, X, y):
return round(sm.explained_variance_score(X, y), 4)
6.結果
我們對alpha取0.01, 0.1, 0.5進行測試暗挑。
if __name__ == "__main__":
ridge_regressor = Ridge_Regression()
X_train, y_train, X_test, y_test = ridge_regressor.preprocess(proportion=0.75)
for alpha in [0.01, 0.1, 0.5]:
ridge_regressor.train(alpha=alpha)
y_predict_test = ridge_regressor.predict(X_test)
print("alpha = {}, test loss: {}".format(round(alpha, 2), ridge_regressor.loss(y_test, y_predict_test)))
print("alpha = {}, variance: {}\n".format(round(alpha, 2), ridge_regressor.variance(y_test, y_predict_test)))
輸出為:
# n_samples = 50
alpha = 0.01, test loss: 1821.4612
alpha = 0.01, variance: 0.1471
alpha = 0.1, test loss: 1796.6773
alpha = 0.1, variance: 0.1571
alpha = 0.5, test loss: 1701.2975
alpha = 0.5, variance: 0.1966
可以看到笋除,增大alpha,test loss有減少炸裆,但是效果并不好垃它,因為我們的variance得分很低,證明方差很大烹看。下面国拇,我們將樣本數(shù)量從50,增大到5000惯殊,看看效果酱吝。
# n_samples = 5000
alpha = 0.01, test loss: 905.622
alpha = 0.01, variance: 0.6951
alpha = 0.1, test loss: 905.6208
alpha = 0.1, variance: 0.6951
alpha = 0.5, test loss: 905.6158
alpha = 0.5, variance: 0.6951
可以看到,loss有明顯下降靠胜,同時variance得分明顯提高掉瞳,這個效果要比ridge regression本身效果要好(也可能是我數(shù)據(jù)產生的不太好毕源,沒太發(fā)揮出來ridge regression模型的效果)浪漠。但是,增大樣本霎褐,毫無疑問是提示模型效果的關鍵手段址愿。