使用sklearn的DecisionTreeClassifier解決分類問題實例。
數(shù)據(jù)集描述
數(shù)據(jù)集存放在一個csv文件中商膊,其中11列特征變量伏伐,1列目標(biāo)變量。特征變量的類型有數(shù)字類型和字符串類型晕拆。
加載數(shù)據(jù)
from sklearn import tree
from sklearn.model_selection import train_test_split
import pandas as pandas
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)
處理數(shù)據(jù)
1藐翎、剔除Nan的數(shù)據(jù)
full_data = full_data.dropna(axis=0)
2、拆分特征變量和目標(biāo)變量
out = full_data['Survived']
features = full_data.drop('Survived', axis = 1)
3实幕、將特征變量中的字符串類型轉(zhuǎn)成數(shù)字類型
features = pandas.get_dummies(features)
拆分訓(xùn)練集和測試集
X_train, X_test, y_train, y_test = train_test_split(features, out, test_size = 0.2, random_state = 0)
# 顯示切分的結(jié)果
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])
定義評價指標(biāo)
def accuracy_score(truth, pred):
""" Returns accuracy score for input truth and predictions. """
# Ensure that the number of predictions matches number of outcomes
# 確保預(yù)測的數(shù)量與結(jié)果的數(shù)量一致
if len(truth) == len(pred):
# Calculate and return the accuracy as a percent
# 計算預(yù)測準(zhǔn)確率(百分比)
# 用bool的平均數(shù)算百分比
return(truth == pred).mean()*100
else:
return 0
建模
用兩種方式吝镣,一種是用網(wǎng)格搜索和交叉驗證找決策樹的最優(yōu)參數(shù),創(chuàng)建有最優(yōu)參數(shù)的決策樹昆庇,一種是默認(rèn)決策樹
創(chuàng)建決策樹,用網(wǎng)格搜索和交叉驗證找最優(yōu)參數(shù)并擬合數(shù)據(jù)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeClassifier
def fit_model_k_fold(X, y):
""" Performs grid search over the 'max_depth' parameter for a
decision tree regressor trained on the input data [X, y]. """
# Create cross-validation sets from the training data
# cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.20, random_state = 0)
k_fold = KFold(n_splits=10)
# Create a decision tree clf object
clf = DecisionTreeClassifier(random_state=80)
params = {'max_depth':range(1,21),'criterion':np.array(['entropy','gini'])}
# Transform 'accuracy_score' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(accuracy_score)
# Create the grid search object
grid = GridSearchCV(clf, param_grid=params,scoring=scoring_fnc,cv=k_fold)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
查看最優(yōu)參數(shù)
print "k_fold Parameter 'max_depth' is {} for the optimal model.".format(clf.get_params()['max_depth'])
print "k_fold Parameter 'criterion' is {} for the optimal model.".format(clf.get_params()['criterion'])
創(chuàng)建默認(rèn)參數(shù)的決策樹
def predict_4(X, Y):
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
return clf
預(yù)測
clf = fit_model_k_fold(X_train, y_train)
繪制決策樹
from IPython.display import Image
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None,
class_names=['0','1'],
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
決策樹
以上內(nèi)容來自822實驗室2017年5月7日17:30第二次知識分享活動:Titanic幸存者預(yù)測末贾。
我們的822,我們的青春
歡迎所有熱愛知識熱愛生活的朋友和822實驗室一起成長整吆,吃喝玩樂拱撵,享受知識。