理論
集成模型
集成分類器模型是綜合考慮多種機(jī)器學(xué)習(xí)模型的訓(xùn)練結(jié)果述吸,做出分類決策的分類器模型
- 投票式:平行訓(xùn)練多種機(jī)器學(xué)習(xí)模型猾瘸,每個(gè)模型的輸出進(jìn)行投票做出分類決策
- 順序式:按順序搭建多個(gè)模型殿雪,模型之間存在依賴關(guān)系歌粥,最終整合模型
隨機(jī)森林分類器
隨機(jī)森林分類器是投票式的集成模型陨溅,核心思想是訓(xùn)練數(shù)個(gè)并行的決策樹针余,對所有決策樹的輸出做投票處理脚祟,為了防止所有決策樹生長成相同的樣子谬以,決策樹的特征選取由最大熵增變?yōu)殡S機(jī)選取
梯度上升決策樹
梯度上升決策樹不常用于分類問題(可查找到的資料幾乎全在講回歸樹),其基本思想是每次訓(xùn)練的數(shù)據(jù)是(上次訓(xùn)練數(shù)據(jù),殘差)組成(不清楚分類問題的殘差是如何計(jì)算的)由桌,最后按權(quán)值組合出每個(gè)決策樹的結(jié)果
代碼實(shí)現(xiàn)
導(dǎo)入數(shù)據(jù)集——泰坦尼克遇難者數(shù)據(jù)
import pandas as pd
titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
print(titan.head())
row.names pclass survived \
0 1 1st 1
1 2 1st 0
2 3 1st 0
3 4 1st 0
4 5 1st 1
name age embarked \
0 Allen, Miss Elisabeth Walton 29.0000 Southampton
1 Allison, Miss Helen Loraine 2.0000 Southampton
2 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton
3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton
4 Allison, Master Hudson Trevor 0.9167 Southampton
home.dest room ticket boat sex
0 St Louis, MO B-5 24160 L221 2 female
1 Montreal, PQ / Chesterville, ON C26 NaN NaN female
2 Montreal, PQ / Chesterville, ON C26 NaN (135) male
3 Montreal, PQ / Chesterville, ON C26 NaN NaN female
4 Montreal, PQ / Chesterville, ON C22 NaN 11 male
數(shù)據(jù)預(yù)處理
選取特征
x = titan[['pclass','age',"sex"]]
y = titan['survived']
print(x.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass 1313 non-null object
age 633 non-null float64
sex 1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None
缺失數(shù)據(jù)處理
x.fillna(x['age'].mean(),inplace=True)
print(x.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass 1313 non-null object
age 1313 non-null float64
sex 1313 non-null object
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None
c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\pandas\core\frame.py:2754: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
downcast=downcast, **kwargs)
劃分?jǐn)?shù)據(jù)集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=1)
print(x_train.shape,x_test.shape)
(984, 3) (329, 3)
特征向量化
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
x_test = vec.transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)
['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']
模型訓(xùn)練
隨機(jī)森林
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
梯度提升決策樹
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False)
模型評(píng)估
隨機(jī)森林
rfc.score(x_test,y_test)
0.83282674772036469
from sklearn.metrics import classification_report
rfc_pre = rfc.predict(x_test)
print(classification_report(rfc_pre,y_test))
precision recall f1-score support
0 0.89 0.84 0.87 211
1 0.74 0.82 0.78 118
avg / total 0.84 0.83 0.83 329
梯度提升決策樹
gbc.score(x_test,y_test)
0.82370820668693012
from sklearn.metrics import classification_report
print(classification_report(gbc.predict(x_test),y_test))
precision recall f1-score support
0 0.92 0.81 0.86 224
1 0.68 0.85 0.75 105
avg / total 0.84 0.82 0.83 329
?