決策樹可以用于分類夹抗,也可以用于回歸
衡量標(biāo)準(zhǔn)去選擇哪個(gè)作為根節(jié)點(diǎn)
熵值越大杏愤,越混亂珊楼;越小越穩(wěn)定
sklearn案例
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets.california_housing import fetch_california_housing
housing = fetch_california_housing()
print(housing.DESCR)
downloading Cal. housing from http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz to C:\Users\user\scikit_learn_data
California housing dataset.
The original database is available from StatLib
http://lib.stat.cmu.edu/
The data contains 20,640 observations on 9 variables.
This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.
References
----------
Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.
housing.data.shape
(20640, 8)
housing.data[0]
array([ 8.3252 , 41. , 6.98412698, 1.02380952,
322. , 2.55555556, 37.88 , -122.23 ])
from sklearn import tree
dtr = tree.DecisionTreeRegressor(max_depth = 2)
dtr.fit(housing.data[:, [6, 7]], housing.target)
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
max_leaf_nodes=None, min_impurity_split=1e-07,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
#要可視化顯示 首先需要安裝 graphviz http://www.graphviz.org/Download..php
dot_data = \
tree.export_graphviz(
dtr,
out_file = None,
feature_names = housing.feature_names[6:8],
filled = True,
impurity = False,
rounded = True
)
#pip install pydotplus
import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data)
graph.get_nodes()[7].set_fillcolor("#FFF2DD")
from IPython.display import Image
Image(graph.create_png())
graph.write_png("dtr_white_background.png")
True
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = \
train_test_split(housing.data, housing.target, test_size = 0.1, random_state = 42)
dtr = tree.DecisionTreeRegressor(random_state = 42)
dtr.fit(data_train, target_train)
dtr.score(data_test, target_test)
0.637318351331017
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor( random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)
0.79086492280964926
樹模型參數(shù):
1.criterion gini or entropy
2.splitter best or random 前者是在所有特征中找最好的切分點(diǎn) 后者是在部分特征中(數(shù)據(jù)量大的時(shí)候)
3.max_features None(所有)骗炉,log2,sqrt厕鹃,N 特征小于50的時(shí)候一般使用所有的
4.max_depth 數(shù)據(jù)少或者特征少的時(shí)候可以不管這個(gè)值,如果模型樣本量多剂碴,特征也多的情況下轻专,可以嘗試限制下
5.min_samples_split 如果某節(jié)點(diǎn)的樣本數(shù)少于min_samples_split,則不會(huì)繼續(xù)再嘗試選擇最優(yōu)特征來進(jìn)行劃分如果樣本量不大催训,不需要管這個(gè)值。如果樣本量數(shù)量級(jí)非常大亚兄,則推薦增大這個(gè)值。
6.min_samples_leaf 這個(gè)值限制了葉子節(jié)點(diǎn)最少的樣本數(shù)审胚,如果某葉子節(jié)點(diǎn)數(shù)目小于樣本數(shù)礼旅,則會(huì)和兄弟節(jié)點(diǎn)一起被剪枝,如果樣本量不大懒鉴,不需要管這個(gè)值碎浇,大些如10W可是嘗試下5
7.min_weight_fraction_leaf 這個(gè)值限制了葉子節(jié)點(diǎn)所有樣本權(quán)重和的最小值,如果小于這個(gè)值奴璃,則會(huì)和兄弟節(jié)點(diǎn)一起被剪枝默認(rèn)是0,就是不考慮權(quán)重問題抄课。一般來說雳旅,如果我們有較多樣本有缺失值,或者分類樹樣本的分布類別偏差很大攒盈,就會(huì)引入樣本權(quán)重,這時(shí)我們就要注意這個(gè)值了僵蛛。
8.max_leaf_nodes 通過限制最大葉子節(jié)點(diǎn)數(shù)迎变,可以防止過擬合,默認(rèn)是"None”驼侠,即不限制最大的葉子節(jié)點(diǎn)數(shù)。如果加了限制泪电,算法會(huì)建立在最大葉子節(jié)點(diǎn)數(shù)內(nèi)最優(yōu)的決策樹。如果特征不多碟渺,可以不考慮這個(gè)值突诬,但是如果特征分成多的話,可以加以限制具體的值可以通過交叉驗(yàn)證得到绒极。
9.class_weight 指定樣本各類別的的權(quán)重蔬捷,主要是為了防止訓(xùn)練集某些類別的樣本過多導(dǎo)致訓(xùn)練的決策樹過于偏向這些類別。這里可以自己指定各個(gè)樣本的權(quán)重如果使用“balanced”周拐,則算法會(huì)自己計(jì)算權(quán)重,樣本量少的類別所對(duì)應(yīng)的樣本權(quán)重會(huì)高审丘。
10.min_impurity_split 這個(gè)值限制了決策樹的增長勾给,如果某節(jié)點(diǎn)的不純度(基尼系數(shù),信息增益播急,均方差,絕對(duì)差)小于這個(gè)閾值則該節(jié)點(diǎn)不再生成子節(jié)點(diǎn)惭笑。即為葉子節(jié)點(diǎn) 生真。
n_estimators:要建立樹的個(gè)數(shù)
from sklearn.grid_search import GridSearchCV
tree_param_grid = { 'min_samples_split': list((3,6,9)),'n_estimators':list((10,50,100))}
grid = GridSearchCV(RandomForestRegressor(),param_grid=tree_param_grid, cv=5) #cv交叉驗(yàn)證數(shù) 網(wǎng)格搜索捺宗,驗(yàn)證參數(shù)組合效果
grid.fit(data_train, target_train)
grid.grid_scores_, grid.best_params_, grid.best_score_
([mean: 0.78405, std: 0.00505, params: {'min_samples_split': 3, 'n_estimators': 10},
mean: 0.80529, std: 0.00448, params: {'min_samples_split': 3, 'n_estimators': 50},
mean: 0.80673, std: 0.00433, params: {'min_samples_split': 3, 'n_estimators': 100},
mean: 0.79016, std: 0.00124, params: {'min_samples_split': 6, 'n_estimators': 10},
mean: 0.80496, std: 0.00491, params: {'min_samples_split': 6, 'n_estimators': 50},
mean: 0.80671, std: 0.00408, params: {'min_samples_split': 6, 'n_estimators': 100},
mean: 0.78747, std: 0.00341, params: {'min_samples_split': 9, 'n_estimators': 10},
mean: 0.80481, std: 0.00322, params: {'min_samples_split': 9, 'n_estimators': 50},
mean: 0.80603, std: 0.00437, params: {'min_samples_split': 9, 'n_estimators': 100}],
{'min_samples_split': 3, 'n_estimators': 100},
0.8067250881273065)
rfr = RandomForestRegressor( min_samples_split=3,n_estimators = 100,random_state = 42)
rfr.fit(data_train, target_train)
rfr.score(data_test, target_test)
0.80908290496531576
pd.Series(rfr.feature_importances_, index = housing.feature_names).sort_values(ascending = False)
MedInc 0.524257
AveOccup 0.137947
Latitude 0.090622
Longitude 0.089414
HouseAge 0.053970
AveRooms 0.044443
Population 0.030263
AveBedrms 0.029084
dtype: float64