機(jī)器學(xué)習(xí)項(xiàng)目實(shí)戰(zhàn)-信用卡欺詐檢測

流程：

首先要觀察數(shù)據(jù)，當(dāng)前數(shù)據(jù)是否分布均衡睁冬，不均衡的情況下就要想一些方法。(這次的數(shù)據(jù)是比較純凈的包吝，就不需要做其他一些預(yù)處理的操作锥余，直接原封不動的拿出來就可以了腹纳。很多情況下，不見得可以直接拿到特征數(shù)據(jù)驱犹。)
讓數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化嘲恍，讓數(shù)據(jù)的浮動比較小一些，然后再進(jìn)行數(shù)據(jù)的選擇雄驹。
混淆矩陣以及模型的評估標(biāo)準(zhǔn)佃牛，然后通過交叉驗(yàn)證的方式來進(jìn)行參數(shù)的選擇。
通過閾值與預(yù)測值進(jìn)行比較医舆，然后得到最終的一個(gè)預(yù)測結(jié)果俘侠。不同的閾值會使結(jié)果發(fā)生很大的變化。
SMOTE算法蔬将。

一爷速、任務(wù)基礎(chǔ)

數(shù)據(jù)集包含由歐洲人于2013年9月使用信用卡進(jìn)行交易的數(shù)據(jù)。此數(shù)據(jù)集顯示兩天內(nèi)發(fā)生的交易霞怀，其中284807筆交易中有492筆被盜刷惫东。數(shù)據(jù)集非常不平衡，正例（被盜刷）占所有交易的0.172％里烦。凿蒜，這是因?yàn)橛捎诒Ｃ軉栴}禁谦，我們無法提供有關(guān)數(shù)據(jù)的原始功能和更多背景信息胁黑。特征V1，V2州泊，... V28是使用PCA獲得的主要組件丧蘸，沒有用PCA轉(zhuǎn)換的唯一特征是“Class”和“Amount”。特征'Time'包含數(shù)據(jù)集中每個(gè)刷卡時(shí)間和第一次刷卡時(shí)間之間經(jīng)過的秒數(shù)。特征'Class'是響應(yīng)變量力喷，如果發(fā)生被盜刷刽漂，則取值1，否則為0弟孟。

任務(wù)目的：完成數(shù)據(jù)集中正常交易數(shù)據(jù)和異常交易數(shù)據(jù)的分類贝咙，并對測試數(shù)據(jù)進(jìn)行預(yù)測。

首先導(dǎo)入需要使用的庫:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

讀取數(shù)據(jù)集文件拂募，查看數(shù)據(jù)集前5行數(shù)據(jù):

data = pd.read_csv("creditcard.csv")
data.head()

image1.png

image2.png

在上圖中Class標(biāo)簽代表數(shù)據(jù)分類庭猩，0代表正常數(shù)據(jù)，1代表欺詐數(shù)據(jù)陈症。

這里是做信用卡數(shù)據(jù)的欺詐檢測蔼水。在整個(gè)數(shù)據(jù)里面，有正常的數(shù)據(jù)录肯，也有問題的數(shù)據(jù)趴腋。對于一般情況來說，有問題的數(shù)據(jù)肯定只占了極少部分论咏。

下面繪出柱狀圖可以直觀顯示正常數(shù)據(jù)與異常數(shù)據(jù)的數(shù)量差異优炬。

count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar') # 使用pandas可以繪制一些簡單的圖
# 欺詐類別柱狀圖
plt.title("Fraud class histogram")
plt.xlabel("Class")
# 頻率
plt.ylabel("Frequency")

image3.png

從輸出的結(jié)果可以看出正常的樣本0大概有28萬個(gè)，異常的樣本1非常少厅贪，從圖中不太容易看出來穿剖，但是實(shí)際上是存在的，大概只有那么幾百個(gè)卦溢。

因?yàn)锳mount這列的數(shù)據(jù)浮動太大糊余，在做機(jī)器學(xué)習(xí)的過程中，需要保證特征值差異不能過大单寂，于是需要對Amount進(jìn)行預(yù)處理贬芥，標(biāo)準(zhǔn)化數(shù)據(jù)。

Time這一列本身沒有多大用處宣决，Amount這一列被標(biāo)準(zhǔn)化后的數(shù)據(jù)代替蘸劈。所有刪除這兩列的數(shù)據(jù)。

# 預(yù)處理  標(biāo)準(zhǔn)化數(shù)據(jù)
from sklearn.preprocessing import StandardScaler
# norm 標(biāo)準(zhǔn)  -1表示自動判斷X維度  對比源碼 這里要加上.values<br># 加上新的特征列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

image4.png

image5.png

二尊沸、樣本數(shù)據(jù)分布不均衡解決方案

上面說到數(shù)據(jù)集里面正常數(shù)據(jù)和異常數(shù)據(jù)數(shù)量差異極大威沫，對于這種樣本數(shù)據(jù)不均衡問題，一般有以下兩種策略：

（1）下采樣策略：之前統(tǒng)計(jì)的結(jié)果可以看出0的樣本有28萬個(gè)洼专，而1的樣本只有幾百個(gè)“袈樱現(xiàn)在將0的數(shù)據(jù)也變成幾百個(gè)就可以了。下采樣屁商，是使樣本的數(shù)據(jù)同樣少
（2）過采樣策略：之前統(tǒng)計(jì)的結(jié)果可以看出0的樣本有28萬個(gè)烟很，而1的樣本只有幾百個(gè)。0比較多1比較少,對1的樣本數(shù)據(jù)進(jìn)行生成數(shù)列，讓生成的數(shù)據(jù)與0的樣本數(shù)據(jù)一樣多雾袱。

下面首先采用下采樣策略:

# loc 基于標(biāo)簽索引  iloc 基于行號索引
# ix 基于行號和標(biāo)簽索引都行  但是已被放棄
 
# X = data.ix[:, data.columns != 'Class']
# # print(X)
# y = data.ix[:, data.columns == 'Class']
 
X = data.iloc[:, data.columns != 'Class'] # 特征數(shù)據(jù)
# print(X)
y = data.iloc[:, data.columns == 'Class'] #
 
# Number of data points in the minority class 選取少部分異常數(shù)據(jù)集
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)
 
# Picking the indices of the normal classes 選取正常類的索引
normal_indices = data[data.Class == 0].index
 
# Out of the indices we picked, randomly select "x" number (number_records_fraud)
# 從正常類的索引中隨機(jī)選取 X 個(gè)數(shù)據(jù)  replace 代替的意思
random_normal_indices = np.random.choice(normal_indices,
                                         number_records_fraud,
                                         replace=False)
random_normal_indices = np.array(random_normal_indices)
 
# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
 
# Under sample dataset
under_sample_data = data.iloc[under_sample_indices, :]
 
X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']
 
# Showing ratio   transactions:交易
print(
    "Percentage of normal transactions:",
    len(under_sample_data[under_sample_data.Class == 0]) /
    len(under_sample_data))
print(
    "Percentage of fraud transactions:",
    len(under_sample_data[under_sample_data.Class == 1]) /
    len(under_sample_data))
print("Total number of transactions in resampled data:",
      len(under_sample_data))

可以看出經(jīng)過下采樣策略過后恤筛，正常數(shù)據(jù)與異常數(shù)據(jù)各占50%，并且總樣本數(shù)也只有少部分芹橡。

下面對原始數(shù)據(jù)集和下采樣后的數(shù)據(jù)集分別進(jìn)行切分操作毒坛。

# sklearn更新后在執(zhí)行以下代碼時(shí)可能會出現(xiàn)這樣的問題：
# from sklearn.cross_validation import train_test_split
# ModuleNotFoundError: No module named 'sklearn.cross_validation'
# 原因新版本已經(jīng)不支持 改為以下代碼
from sklearn.model_selection import train_test_split
 
# Whole dataset  test_size 表示訓(xùn)練集測試集的比例 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=0)
 
print("Number transactions train dataset:", len(X_train))
print("Number transactions test dataset:", len(X_test))
print("Total number of transactions:", len(X_train) + len(X_test))
 
# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(
    X_undersample, y_undersample, test_size=0.3, random_state=0)
 
print("")
print("Number transactions train dataset:", len(X_train_undersample))
print("Number transactions test dataset:", len(X_test_undersample))
print("Total number of transactions:", len(X_train_undersample) + len(X_test_undersample))

print("")
print("Number transactions train dataset:", len(X_train_undersample))
print("Number transactions test dataset:", len(X_test_undersample))
print("Total number of transactions:", len(X_train_undersample) + len(X_test_undersample))
#如下：
Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807

Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984

三、模型評估方法：

假設(shè)有1000個(gè)病人的數(shù)據(jù)林说，有990個(gè)人不患癌癥粘驰，10個(gè)人是患癌癥。用一個(gè)最常見的評估標(biāo)準(zhǔn)述么，比方說精度蝌数，就是真實(shí)值與預(yù)測值之間的差異，真實(shí)值用y來表示度秘，預(yù)測值用y1來表示顶伞。y真實(shí)值1，2剑梳，3...10,共有10個(gè)樣本唆貌，y1預(yù)測值1，2垢乙，3...10锨咙，共有10個(gè)樣本，精度就是看真實(shí)值y與預(yù)測值y1是否一樣的追逮，要么都是0酪刀，要么都是1，如果是一致钮孵，就用“=”表示骂倘，比如1號真實(shí)值樣本=預(yù)測值的1號樣本,如果不相等就用不等號來表示。如果等號出現(xiàn)了8個(gè)巴席，那么它的精確度為8/10=80%,從而確定模型的精度历涝。

990個(gè)人不患癌癥，10個(gè)人是患癌癥建立一個(gè)模型漾唉，所有的預(yù)測值都會建立一個(gè)正樣本荧库。對1000個(gè)樣本輸入到模型,它的精確度是多少呢?990/1000=99%。這個(gè)模型把所有的值都預(yù)測成正樣本赵刑，但是沒有得到任何一個(gè)負(fù)樣本分衫。在醫(yī)院是想得到癌癥的識別，但是檢查出來的結(jié)果是0個(gè)料睛，雖然精度達(dá)到了99%丐箩，但這個(gè)模型是沒有任何的含義的，因?yàn)橐粋€(gè)癌癥病人都找不出來恤煞。在建立模型的時(shí)候一定要想好一件事屎勘，模型雖然很容易建立出來，那么難點(diǎn)是應(yīng)該怎么樣去評估這樣的模型呢?

剛才提到了用精度去評估模型居扒，但是精度有些時(shí)候是騙人的概漱。尤其是在樣本數(shù)據(jù)不均衡的情況下。接下來要講到一個(gè)知識點(diǎn)叫recall喜喂，叫召回率或叫查全率瓤摧。recall有0或者1，我們的目標(biāo)是找出患有癌癥的那10個(gè)人玉吁。因此根據(jù)目標(biāo)制定衡量的標(biāo)準(zhǔn)照弥，就是有10個(gè)癌癥病人，能夠檢測出來有幾個(gè)?如果檢測0個(gè)癌癥病人进副，那么recall值就是0/10=0这揣。如果檢測2個(gè)癌癥病人，那么recall值就是2/10=20%影斑。用recall檢測模型的效果更科學(xué)一些给赞。建立模型無非是選擇一些參數(shù)，recall的表示也并非那么容易.在統(tǒng)計(jì)學(xué)中會經(jīng)常提到的4個(gè)詞矫户，分別如下：

image6.png

# Recall = TP/(TP+FN) Recall(召回率或查全率)
from sklearn.linear_model import LogisticRegression  # 使用邏輯回歸模型
# from sklearn.cross_validation import KFold, cross_val_score  版本更新這行代碼也不再支持
from sklearn.model_selection import KFold, cross_val_score  # fold:折疊 KFold 表示切分成幾分?jǐn)?shù)據(jù)進(jìn)行交叉驗(yàn)證
from sklearn.metrics import confusion_matrix, recall_score, classification_report

四片迅、正則化懲罰

比如有A模型的權(quán)重參數(shù)：θ1、θ2皆辽、θ3...θ10柑蛇，比如還有B模型的權(quán)重參數(shù)：θ1、θ2驱闷、θ3...θ10唯蝶，這兩個(gè)模型的recall值都是等于90%。如果兩個(gè)模型的recall值都是等于90%遗嗽，是不是隨便選一個(gè)都可以呢粘我？
但是假如A模型的參數(shù)浮動比較大，具體如截圖：

image7.png

B模型的參數(shù)浮動較小痹换，如截圖所示：

image8.png

雖然兩個(gè)模型的recall值都是等于90%征字，但是A模型的浮動范圍太大了，我們希望模型更加穩(wěn)定一些娇豫，不光滿足訓(xùn)練的數(shù)據(jù)匙姜，還要盡可能的滿足測試數(shù)據(jù)。因此希望模型的浮動差異更小一些冯痢，差異小可以使過度擬合的風(fēng)險(xiǎn)更小一些氮昧。

過度擬合的意思是在訓(xùn)練集表達(dá)效果很好框杜，但是在測試集表達(dá)效果很差，因此這組模型發(fā)生了過擬合袖肥。過擬合是非常常見的現(xiàn)象咪辱，很大程度上是因?yàn)闄?quán)重參數(shù)浮動較大引起的，因此希望得到B模型椎组，因?yàn)锽模型的浮動差異比較小油狂。那么怎么樣能夠得到B模型呢？從而就引入了正則化的東西寸癌，懲罰模型參數(shù)θ专筷，因?yàn)槟Ｐ偷臄?shù)據(jù)有時(shí)候分布大，有時(shí)候分布小蒸苇。希望大力度懲罰A模型磷蛹，小力度懲罰B模型。我們可以利用正則化找到更為簡潔的描述方式的量化過程溪烤，我們將損失函數(shù)改造為：

image9.png

C0表示未引入正則化懲罰之前的損失函數(shù)弦聂，C表示引入正則化懲罰后新的損失函數(shù)，w代表權(quán)重參數(shù)值氛什。上面這個(gè)式子表達(dá)的是L1正則化莺葫。對于A模型，w值浮動比較大枪眉，如果計(jì)算|w|的話捺檬，這樣的話計(jì)算的目標(biāo)損失函數(shù)的值就會更大。所有就加上λ參數(shù)來懲罰這個(gè)權(quán)重值贸铜。下面還有一種L2正則化堡纬。

image10.png

于是最主要就是需要設(shè)置當(dāng)前懲罰的力度到底有多大？可以設(shè)置成0.1蒿秦，那么懲罰力度就比較小烤镐，也可以設(shè)置懲罰力度為1，也可以設(shè)置懲罰力度為10棍鳖。但是懲罰力度等于多少的時(shí)候炮叶，效果比較好呢？具體多少也不知道渡处，需要通過交叉驗(yàn)證镜悉，去評估一下什么樣的參數(shù)達(dá)到更好的效果。C_param_range = [0.01,0.1,1,10,100]這里就是前面提到的λ參數(shù)医瘫。需要將這5個(gè)參數(shù)不斷的嘗試侣肄。

五、交叉驗(yàn)證

比如有個(gè)集合叫data醇份，通常建立機(jī)器模型的時(shí)候稼锅，先對數(shù)據(jù)進(jìn)行切分或者選擇吼具，取前面80%的數(shù)據(jù)當(dāng)成訓(xùn)練集，取20%的數(shù)據(jù)當(dāng)成測試集矩距。80%的數(shù)據(jù)是來建立一個(gè)模型拗盒，剩下的20%的數(shù)據(jù)是用來測試模型。因此第一步是將數(shù)據(jù)進(jìn)行切分剩晴，切分成訓(xùn)練集以及測試集锣咒。這部分操作是必須要做的侵状。第二步還要在訓(xùn)練集進(jìn)行平均切分赞弥，比如平均切分成3份，分別是數(shù)據(jù)集1,2,3趣兄。

在建立模型的時(shí)候绽左，不管建立什么樣的模型，這個(gè)模型伴隨著很多參數(shù)艇潭，有不同的參數(shù)進(jìn)行選擇拼窥，這個(gè)參數(shù)選擇大比較好，還是選擇小比較好一些蹋凝？從經(jīng)驗(yàn)值角度來說鲁纠，肯定沒辦法很準(zhǔn)的，怎么樣去確定這個(gè)參數(shù)呢鳍寂？只能通過交叉驗(yàn)證的方式改含。

那什么又叫交叉驗(yàn)證呢？

第一次：將數(shù)據(jù)集1,2分別建立模型迄汛，用數(shù)據(jù)集3在當(dāng)前權(quán)重下去驗(yàn)證當(dāng)前模型的效果捍壤。數(shù)據(jù)集3是個(gè)驗(yàn)證集，驗(yàn)證集是訓(xùn)練集的一部分鞍爱。用驗(yàn)證集去驗(yàn)證模型是好還是壞鹃觉。
第二次：將數(shù)據(jù)集1,3分別建立模型，用數(shù)據(jù)集2在當(dāng)前權(quán)重下去驗(yàn)證當(dāng)前模型的效果睹逃。
第三次：將數(shù)據(jù)集2,3分別建立模型盗扇，用數(shù)據(jù)集1在當(dāng)前權(quán)重下去驗(yàn)證當(dāng)前模型的效果。

如果只是求一次的交叉驗(yàn)證沉填，這樣的操作會存在風(fēng)險(xiǎn)粱玲。比如只做第一次交叉驗(yàn)證，會使3驗(yàn)證集偏簡單一些拜轨。會使模型效果偏高抽减，此外模型有些數(shù)據(jù)是錯(cuò)誤值以及離群值，如果把這些不太好的數(shù)據(jù)當(dāng)成驗(yàn)證集橄碾，會使模型的效果偏低的卵沉。模型當(dāng)然是不希望偏高也不希望偏低颠锉，那就需要多做幾次交叉驗(yàn)證模型，求平均值史汗。這里有1琼掠，2，3分別作驗(yàn)證集停撞，每個(gè)驗(yàn)證集都有評估的標(biāo)準(zhǔn)瓷蛙。最終模型的效果將1，2戈毒，3的評估效果加在一起艰猬，再除以3，就可以得到模型一個(gè)大致的效果埋市。

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
     
    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]
     
    result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean recall score'])
    result_table['C_parameter'] = c_param_range
     
    # the k-fold will give 2 lists:train_indices=indices[0],test_indices = indices[1]
    j=0  # 循環(huán)找到最好的懲罰力度
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter:',c_param)
        print('-------------------------------------------')
        print('')
         
        recall_accs = []
        for iteration,indices in enumerate(fold.split(x_train_data)):
             
            # 使用特定的C參數(shù)調(diào)用邏輯回歸模型
            # Call the logistic regression model with a certain C parameter
            # 參數(shù) solver=’liblinear’ 消除警告
            # 出現(xiàn)警告：模型未能收斂 冠桃，請?jiān)黾邮諗看螖?shù)
            #  ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
            #  "the number of iterations.", ConvergenceWarning)
            #  增加參數(shù) max_iter 默認(rèn)1000
            lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000)
            # Use the training data to fit the model. In this case, we use the portion
            # of the fold to train the model with indices[0], We then predict on the
            # portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
             
            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
             
            # Calculate the recall score and append it to a list for recall scores
            # representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ',iteration,': recall score = ',recall_acc)
             
        # the mean value of those recall scores is the metric we want to save and get
        # hold of.
        result_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ',np.mean(recall_accs))
        print('')
         
    # 注意此處報(bào)錯(cuò)  源代碼沒有astype('float64')
    best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter',best_c)
    print('*********************************************************************************')
     
    return best_c

使用下采樣數(shù)據(jù)集調(diào)用上面這個(gè)函數(shù)

best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

輸出結(jié)果：

-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.958904109589041
Iteration  1 : recall score =  0.9178082191780822
Iteration  2 : recall score =  1.0
Iteration  3 : recall score =  0.9864864864864865
Iteration  4 : recall score =  0.9545454545454546

Mean recall score  0.9635488539598128

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.8356164383561644
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9322033898305084
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.8939393939393939

Mean recall score  0.8941437733404299

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.8493150684931506
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.9090909090909091

Mean recall score  0.9100832939235539

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.863013698630137
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9324324324324325
Iteration  4 : recall score =  0.9242424242424242

Mean recall score  0.9131506202785514

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.863013698630137
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.9242424242424242

Mean recall score  0.9158533229812542

*********************************************************************************
Best model to choose from cross validation is with C parameter 0.01
*********************************************************************************

根據(jù)上面結(jié)果可以看出，當(dāng)正則化參數(shù)為0.01時(shí)道宅，recall的值最高食听。

六、混淆矩陣

混淆矩陣是由一個(gè)坐標(biāo)系組成的污茵，有x軸以及y軸樱报，在x軸里面有0和1，在y軸里面有0和1泞当。x軸表達(dá)的是預(yù)測的值迹蛤，y軸表達(dá)的是真實(shí)的值×闳兀可以對比真實(shí)值與預(yù)測值之間的差異笤受，可以計(jì)算當(dāng)前模型衡量的指標(biāo)值。

這里精度的表示：(136+138)/(136+13+9+138)敌蜂。之前有提到recall=TP/(TP+FN)箩兽，在這里的表示具體如下:

image11.png

下面定義繪制混淆矩陣的函數(shù)：

def plot_confusion_matrix(cm,
                          classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    # This function prints and plots the confusion matrix
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
 
    # cneter 改為 center
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j,
                 i,
                 cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
 
    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

下面根據(jù)上面得出的最好的那個(gè)C值，根據(jù)下采樣數(shù)據(jù)集繪制出混淆矩陣章喉。

import itertools
 
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
 
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample)
np.set_printoptions(precision=2)
 
print("Recall metric in the testing dataset:",
      cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
 
# Plot non-normalized confusion.matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                      classes=class_names,
                      title='Confusion matrix')
plt.show()

image12.png

可以看出recall值達(dá)到93%汗贫，但是因?yàn)樯厦鏈y試數(shù)據(jù)集采用的下采樣數(shù)據(jù)集，數(shù)據(jù)利用率太低秸脱。

下面根據(jù)原始的劃分的測試數(shù)據(jù)集來進(jìn)行測試：

lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)
 
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
 
print("Recall metric in the testing dataset:",
      cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
 
# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                      classes=class_names,
                      title="Confusion matrix")
plt.show()

image13.png

可以看到落包，這次測試的樣本數(shù)據(jù)有八萬多。達(dá)到的效果還行摊唇。這里誤預(yù)測的值有一萬多個(gè)咐蝇，有點(diǎn)小多。

那下面如果我們直接拿原始數(shù)據(jù)集來進(jìn)行建模巷查，來看看在樣本數(shù)據(jù)集分布不均衡的情況recall值的情況有序。

best_c = printing_Kfold_scores(X_train, y_train)

-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.4925373134328358
Iteration  1 : recall score =  0.6027397260273972
Iteration  2 : recall score =  0.6833333333333333
Iteration  3 : recall score =  0.5692307692307692
Iteration  4 : recall score =  0.45

Mean recall score  0.5595682284048672

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.5671641791044776
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.6833333333333333
Iteration  3 : recall score =  0.5846153846153846
Iteration  4 : recall score =  0.525

Mean recall score  0.5953102506435158

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.5522388059701493
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.7166666666666667
Iteration  3 : recall score =  0.6153846153846154
Iteration  4 : recall score =  0.5625

Mean recall score  0.612645688837163

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.5522388059701493
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.7333333333333333
Iteration  3 : recall score =  0.6153846153846154
Iteration  4 : recall score =  0.575

Mean recall score  0.6184790221704963

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.5522388059701493
Iteration  1 : recall score =  0.6164383561643836
Iteration  2 : recall score =  0.7333333333333333
Iteration  3 : recall score =  0.6153846153846154
Iteration  4 : recall score =  0.575

Mean recall score  0.6184790221704963

*********************************************************************************
Best model to choose from cross validation is with C parameter 10.0
*********************************************************************************

可以看出抹腿，recall值基本在60%左右。

繪制出混淆矩陣看看：

lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear')
lr.fit(X_train, y_train.values.ravel())
# 注意這里不是x_pred_undersample 而是y_pred_undersample
y_pred_undersample = lr.predict(X_test.values)
 
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_undersample)
np.set_printoptions(precision=2)
 
print("Recall metric in the testing dataset",
      cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
 
# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix,
                      classes=class_names,
                      title='Confusison matrix')
plt.show()

image14.png

可以看出旭寿，在樣本數(shù)據(jù)分布不均衡的情況下警绩，直接進(jìn)行建立模型，結(jié)果并不太好盅称。

在以前學(xué)習(xí)的邏輯回歸模型中肩祥，默認(rèn)是根據(jù)0.5來對結(jié)果進(jìn)行分類。那我們可以作出猜想缩膝，可不可以通過改變這個(gè)閾值來確定到底哪個(gè)閾值對模型的最終結(jié)果更好呢混狠？

lr = LogisticRegression(C=0.01, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) # 返回預(yù)測的概率值
 
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # 閾值列表
plt.figure(figsize=(10, 10))
 
j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:, 1] > i
    plt.subplot(3, 3, j)
    j += 1
 
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,
                                  y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
 
    print("Recall metric in the testing dataset:",
          cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
 
    # Plot non-normalized confusion matrix
    class_names = [0, 1]
    plot_confusion_matrix(cnf_matrix,
                          classes=class_names,
                          title='Threshold >= %s' % i)

Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.9795918367346939
Recall metric in the testing dataset: 0.9387755102040817
Recall metric in the testing dataset: 0.891156462585034
Recall metric in the testing dataset: 0.8367346938775511
Recall metric in the testing dataset: 0.7687074829931972
Recall metric in the testing dataset: 0.5850340136054422

image15.png

圖上可以看出，不同的閾值逞盆，混淆矩陣是長什么樣子的檀蹋。根據(jù)精度松申、recall值和誤預(yù)測的值來綜合考慮云芦，可以看出閾值在0.5和0.6模型的效果不錯(cuò)。

七贸桶、過采樣操作

過采樣操作(SMOTE算法)：

（1）對于少數(shù)類中每一個(gè)樣本x舅逸，以歐氏距離為標(biāo)準(zhǔn)計(jì)算它到少數(shù)類樣本集中所有樣本的距離，得到其k近鄰皇筛。
（2）根據(jù)樣本不平衡比例設(shè)置一個(gè)采樣比例以確定采樣倍率N琉历，對于每一個(gè)少數(shù)類樣本x，從其k近鄰中隨機(jī)選擇若干個(gè)樣本水醋，假設(shè)選擇的近鄰為xn旗笔。
（3）對于每一個(gè)隨機(jī)選出的近鄰xn，分別與原樣本按照如下的公式構(gòu)建新的樣本拄踪。

image16.png

導(dǎo)入相關(guān)的Python庫:

import pandas as pd
from imblearn.over_sampling import SMOTE  # pip install imblearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

得到特征和標(biāo)簽數(shù)據(jù)

credit_cards = pd.read_csv('creditcard.csv')
 
columns = credit_cards.columns
# The labels are in the last column ('Class'). Simply remove it to obtain features columns
features_columns = columns.delete(len(columns) - 1)
 
features = credit_cards[features_columns]
labels = credit_cards['Class']

劃分訓(xùn)練集測試集

features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, test_size=0.2, random_state=0)

根據(jù)SMOTE算法得到過采樣數(shù)據(jù)集

oversampler = SMOTE(random_state=0)
os_features,os_labels = oversampler.fit_sample(features_train,labels_train) # OS  oversampler

可以看看過采樣數(shù)據(jù)集大小

len(os_labels[os_labels==1])
#得到227454條

下面根據(jù)過采樣數(shù)據(jù)集來進(jìn)行交叉驗(yàn)證及邏輯回歸模型建立

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features, os_labels)

-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9687728228394379
Iteration  3 : recall score =  0.9578813158791396
Iteration  4 : recall score =  0.958167089831943

Mean recall score  0.933976130260189

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9703884032311608
Iteration  3 : recall score =  0.9593981160901727
Iteration  4 : recall score =  0.9605082379837548

Mean recall score  0.9350708360111024

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9704105344694036
Iteration  3 : recall score =  0.9585847594552709
Iteration  4 : recall score =  0.9595410030665743

Mean recall score  0.9347191439483347

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9705433218988603
Iteration  3 : recall score =  0.9601894901133203
Iteration  4 : recall score =  0.9604862553720007

Mean recall score  0.9352556980269211

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.8903225806451613
Iteration  1 : recall score =  0.8947368421052632
Iteration  2 : recall score =  0.9703220095164324
Iteration  3 : recall score =  0.9604093162308613
Iteration  4 : recall score =  0.9607170727954188

Mean recall score  0.9353015642586275

*********************************************************************************
Best model to choose from cross validation is with C parameter 100.0
*********************************************************************************

再來看看混淆矩陣

lr = LogisticRegression(C = best_c, penalty = 'l1', solver='liblinear')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)
 
# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)
 
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
 
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

image17.png

經(jīng)過前面的學(xué)習(xí)蝇恶，綜合考慮精度，recall值和誤預(yù)測的值惶桐，發(fā)現(xiàn)過采樣的效果比下采樣的效果要好一點(diǎn)撮弧。

八、總結(jié)

對于樣本不均衡數(shù)據(jù)姚糊，要利用越多的數(shù)據(jù)越好贿衍。下采樣誤預(yù)測值很高，這是模型本身自帶的一個(gè)問題救恨，因?yàn)?和1一樣少贸辈，模型會認(rèn)為原始數(shù)據(jù)0和1的數(shù)據(jù)一樣少，導(dǎo)致誤預(yù)測值偏高肠槽。在這次的案例中擎淤，過采樣的結(jié)果偏好一些躏哩，雖然recall偏低了一點(diǎn)，但是整體的效果還是不錯(cuò)的揉燃。

通過對信用卡欺詐檢測這個(gè)案例了解了機(jī)器學(xué)習(xí)中樣本數(shù)據(jù)分布不均衡的解決方案扫尺、交叉驗(yàn)證、正則化懲罰炊汤、混淆矩陣和模型的評估方法等等正驻。