文章原創(chuàng),最近更新:2018-06-4
1.混淆矩陣
課程來源: python數(shù)據(jù)分析與機器學(xué)習(xí)實戰(zhàn)-唐宇迪
課程資料:這里所涉及到的練習(xí)資料creditcard.csv相關(guān)的鏈接以及密碼如下:
鏈接: https://pan.baidu.com/s/1APgU4cTAaM9zb8_xAIc41Q 密碼: xgg7
這節(jié)課主要介紹什么叫混淆矩陣?
混淆矩陣是由一個坐標(biāo)系組成的,有x軸以及y軸,在x軸里面有0和1,在y軸里面有0和1.x軸表達(dá)的是預(yù)測的值,y軸表達(dá)的是真實的值.可以對比真實值與預(yù)測值之間的差異,可以計算當(dāng)前模型衡量的指標(biāo)值.
之前有提到recall=TP/(TP+FN),在這里的表示具體如下:
這里也可以衡量精度等于多少?
真實是0,預(yù)測也是0,等到的結(jié)果是129;真實是1,預(yù)測也是1,等到的結(jié)果是137.精確度=(129+137)/(129+20+10+137)
從上面可以看出,混淆矩陣可以看出一些指標(biāo).比如精確值以及recall值.
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
輸出結(jié)果為:
Recall metric in the testing dataset: 0.931972789116
混淆矩陣是在什么樣的條件下進(jìn)行計算的呢?在下采樣數(shù)據(jù)集,樣本數(shù)量比較小,大概有200多個樣本,進(jìn)行小規(guī)模的測試,還沒有在大規(guī)模的數(shù)據(jù)進(jìn)行測試.
其實之前有提到,數(shù)據(jù)衡量的時候應(yīng)該在原始的數(shù)據(jù)集上,所以不光只在下采樣數(shù)據(jù)集進(jìn)行測試,還要在原始的數(shù)據(jù)集進(jìn)行測試.原始的數(shù)據(jù)大概有8萬多條數(shù)據(jù),因此還要在原始的數(shù)據(jù)集進(jìn)行操作.
首先我們計算原始數(shù)據(jù)的recall值,
recall=TP/(TP+FN)=135/(135+12)=91%,recall值比較偏高.
這里的recall值偏高,那這里的數(shù)據(jù)有沒有什么問題呢?對角線的數(shù)據(jù)76715以及135都是預(yù)測對的數(shù)據(jù).這里的8581指的是本來沒有異常發(fā)生的,然后誤抓出來了,為了檢測到135個樣本,把額外的8581個樣本也抓出來了,說它也是異常的.這種情況下,顯然不會影響我們的recall值,但是會使精度偏低,從實際的角度來看,使我們的工作量也增大了,雖然找出了135個欺詐行為的樣本,但是也找出了8581個無辜的樣本.
找出樣本之后,需要對實際進(jìn)行分析,確保確實是異常的,但是有8581個無辜的樣本,那應(yīng)該怎么辦呢?因此下采樣數(shù)據(jù),雖然recall值能夠達(dá)到標(biāo)準(zhǔn),但是誤殺有點太多.這個誤殺已經(jīng)超過容忍范圍,那應(yīng)該怎么解決這個問題呢?下采樣會出現(xiàn)這樣的問題,那么過采樣是否效果會更好一些呢?
剛才用了下采樣數(shù)據(jù)集進(jìn)行了建模的操作,然后得到一些recall的結(jié)果值.如果對一個數(shù)據(jù)集的0和1什么都不做,直接拿出來模型的建立,效果到底會有多差呢?效果到底是怎么樣的呢?如果不用過采樣以及下采樣,直接拿原始的數(shù)據(jù)集進(jìn)行交叉驗證.
從輸出結(jié)果可以看出,在樣本數(shù)據(jù)不均衡的情況下,如果什么都不做,模型的效果是不如下采樣的模型數(shù)據(jù)集.這里的recall值只有62%,模型效果并不好.
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
輸出結(jié)果為:
Recall metric in the testing dataset: 0.918367346939
best_c = printing_Kfold_scores(X_train,y_train)
輸出結(jié)果為:
-------------------------------------------
C parameter: 0.01
-------------------------------------------
Iteration 1 : recall score = 0.492537313433
Iteration 2 : recall score = 0.602739726027
Iteration 3 : recall score = 0.683333333333
Iteration 4 : recall score = 0.569230769231
Iteration 5 : recall score = 0.45
Mean recall score 0.559568228405
-------------------------------------------
C parameter: 0.1
-------------------------------------------
Iteration 1 : recall score = 0.567164179104
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.683333333333
Iteration 4 : recall score = 0.584615384615
Iteration 5 : recall score = 0.525
Mean recall score 0.595310250644
-------------------------------------------
C parameter: 1
-------------------------------------------
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.716666666667
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.5625
Mean recall score 0.612645688837
-------------------------------------------
C parameter: 10
-------------------------------------------
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.733333333333
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.575
Mean recall score 0.61847902217
-------------------------------------------
C parameter: 100
-------------------------------------------
Iteration 1 : recall score = 0.55223880597
Iteration 2 : recall score = 0.616438356164
Iteration 3 : recall score = 0.733333333333
Iteration 4 : recall score = 0.615384615385
Iteration 5 : recall score = 0.575
Mean recall score 0.61847902217
*********************************************************************************
Best model to choose from cross validation is with C parameter = 10.0
*********************************************************************************