前言
用Sklearn常用的Ensemble算法對當當熱銷書評論進行分類實踐。
關于集成算法概念可以看這篇文章 總結Bootstraping钮热、Bagging和Boosting
先看一下這篇文章樸素貝葉斯分類算法實踐填抬,本文主要還是用當當評論數(shù)據做的分析。關于代碼部分一些細節(jié)在樸素貝葉斯分類算法實踐已經詳細的解釋了隧期。
- 完整代碼查看:https://github.com/xhades/rates_classify/tree/master/rates_classify
- 訓練數(shù)據下載地址:https://pan.baidu.com/s/1kVOS39l
正文
RandomForest
sklearn RandomForestClassifier文檔地址
代碼
import numpy as np
from numpy import array, argmax, reshape
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pickle
from sklearn.ensemble import RandomForestClassifier as RDF
np.set_printoptions(threshold=np.inf)
# 訓練集測試集 3/7分割
def train(xFile, yFile):
with open(xFile, "rb") as file_r:
X = pickle.load(file_r)
X = reshape(X, (212841, -1)) # reshape一下 (212841, 30*128)
# 讀取label數(shù)據飒责,并且encodig
with open(yFile, "r") as yFile_r:
labelLines = [_.strip("\n") for _ in yFile_r.readlines()]
values = array(labelLines)
labelEncoder = LabelEncoder()
integerEncoded = labelEncoder.fit_transform(values)
integerEncoded = integerEncoded.reshape(len(integerEncoded), 1)
# print(integerEncoded)
# 獲得label 編碼
Y = integerEncoded.reshape(212841, )
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
# 隨機森林分類器
clf = RDF(criterion="gini")
# criterion 可以使用"gini"或者"entropy",前者代表基尼系數(shù)仆潮,后者代表信息增益宏蛉。一般說使用默認的基尼系數(shù)"gini"就可以了,即CART算法性置。除非你更喜歡類似ID3, C4.5的最優(yōu)特征選擇方法檐晕。
clf.fit(X_train, Y_train)
# 測試數(shù)據
predict = clf.predict(X_test)
count = 0
for p, t in zip(predict, Y_test):
if p == t:
count += 1
print("RandomForest Accuracy is:", count/len(Y_test))
if __name__ == "__main__":
xFile = "Res/char_embedded.pkl"
yFile = "data/label.txt"
print("Start Training.....")
train(xFile, yFile)
print("End.....")
主要的參數(shù)說明
- criterion 可以使用"gini"或者"entropy",前者代表基尼系數(shù)蚌讼,后者代表信息增益辟灰。一般說使用默認的基尼系數(shù)"gini"就可以了,即CART算法篡石。除非你更喜歡類似ID3, C4.5的最優(yōu)特征選擇方法芥喇。
- 其他參數(shù)都用默認,以后再更新 =凰萨。=
結果
Start Training.....
RandomForest Accuracy is: 0.9258453009255634
End.....
最終結果大概92.6%左右的準確率
梯度提升算法GradientBoostingClassifier
sklearn GradientBoostingClassifier 文檔地址
Boosting不斷串行地迭代弱學習器最終形成一個強學習器继控,這點和Bagging并行的 方式不同,所以在用梯度提升算法時耗時非常長
代碼
import numpy as np
from numpy import array, argmax, reshape
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pickle
from sklearn.ensemble import GradientBoostingClassifier as GBC
np.set_printoptions(threshold=np.inf)
# 訓練集測試集 3/7分割
def train(xFile, yFile):
with open(xFile, "rb") as file_r:
X = pickle.load(file_r)
X = reshape(X, (212841, -1)) # reshape一下 (212841, 30*128)
# 讀取label數(shù)據胖眷,并且Encoding
with open(yFile, "r") as yFile_r:
labelLines = [_.strip("\n") for _ in yFile_r.readlines()]
values = array(labelLines)
labelEncoder = LabelEncoder()
integerEncoded = labelEncoder.fit_transform(values)
integerEncoded = integerEncoded.reshape(len(integerEncoded), 1)
# print(integerEncoded)
# 獲得label 編碼
Y = integerEncoded.reshape(212841, )
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
# 梯度提升分類器
clf = GBC(loss="deviance", subsample=0.8, criterion="friedman_mse")
clf.fit(X_train, Y_train)
# 測試數(shù)據
predict = clf.predict(X_test)
count = 0
for p, t in zip(predict, Y_test):
if p == t:
count += 1
print("GradientBoosting Accuracy is:", count/len(Y_test))
if __name__ == "__main__":
xFile = "Res/char_embedded.pkl"
yFile = "data/label.txt"
print("Start Training.....")
train(xFile, yFile)
print("End.....")
主要的參數(shù)說明
- subsample 數(shù)據隨機抽樣對決策樹進行訓練武通,這個參數(shù)設置比1小即可,具體數(shù)值需要在“調參”過程中發(fā)現(xiàn)最優(yōu)
- 其他參數(shù)日后再(tai)整(lan)理(le)
- 源碼中的默認參數(shù)設置
_SUPPORTED_LOSS = ('deviance', 'exponential')
def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, min_impurity_split=1e-7, init=None,
random_state=None, max_features=None, verbose=0,
max_leaf_nodes=None, warm_start=False,
presort='auto'):
super(GradientBoostingClassifier, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
criterion=criterion, min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_depth=max_depth, init=init, subsample=subsample,
max_features=max_features,
random_state=random_state, verbose=verbose,
max_leaf_nodes=max_leaf_nodes,
min_impurity_split=min_impurity_split,
warm_start=warm_start,
presort=presort)
結果
Start Training.....
GradientBoosting Accuracy is: 0.8833727467777551
End.....
最終準確率88.3%左右