邏輯回歸案例

樣本不平衡處理之(下采樣-過采樣方式)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
data = pd.read_csv("creditcard.csv")
data.head()
image.png
count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
<matplotlib.text.Text at 0x216366d8860>
output_2_1.png
from sklearn.preprocessing import StandardScaler

data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()
image.png
X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))
Percentage of normal transactions:  0.5
Percentage of fraud transactions:  0.5
Total number of transactions in resampled data:  984
from sklearn.cross_validation import train_test_split

# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled dataset   下采樣數(shù)據(jù)集
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample
                                                                                                   ,y_undersample
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)


Number transactions train dataset:  199364
Number transactions test dataset:  85443
Total number of transactions:  284807

Number transactions train dataset:  688
Number transactions test dataset:  296
Total number of transactions:  984
image.png
#Recall = TP/(TP+FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report 
def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(len(y_train_data),5,shuffle=False) 

    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]

    results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        for iteration, indices in enumerate(fold,start=1):

            # Call the logistic regression model with a certain C parameter
            lr = LogisticRegression(C = c_param, penalty = 'l1')

            # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
            # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())

            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)

            # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)

        # The mean value of those recall scores is the metric we want to save and get hold of.
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')
    
    return best_c
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)
-------------------------------------------
C parameter:  0.01
-------------------------------------------

Iteration  1 : recall score =  0.958904109589
Iteration  2 : recall score =  0.917808219178
Iteration  3 : recall score =  1.0
Iteration  4 : recall score =  0.972972972973
Iteration  5 : recall score =  0.954545454545

Mean recall score  0.960846151257

-------------------------------------------
C parameter:  0.1
-------------------------------------------

Iteration  1 : recall score =  0.835616438356
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.915254237288
Iteration  4 : recall score =  0.932432432432
Iteration  5 : recall score =  0.878787878788

Mean recall score  0.885020937099

-------------------------------------------
C parameter:  1
-------------------------------------------

Iteration  1 : recall score =  0.835616438356
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.966101694915
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.893939393939

Mean recall score  0.900923434357

-------------------------------------------
C parameter:  10
-------------------------------------------

Iteration  1 : recall score =  0.849315068493
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.966101694915
Iteration  4 : recall score =  0.959459459459
Iteration  5 : recall score =  0.893939393939

Mean recall score  0.906365863087

-------------------------------------------
C parameter:  100
-------------------------------------------

Iteration  1 : recall score =  0.86301369863
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.966101694915
Iteration  4 : recall score =  0.959459459459
Iteration  5 : recall score =  0.893939393939

Mean recall score  0.909105589115

*********************************************************************************
Best model to choose from cross validation is with C parameter =  0.01
*********************************************************************************
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
Recall metric in the testing dataset:  0.931972789116
output_10_1.png
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
Recall metric in the testing dataset:  0.918367346939
output_11_1.png
best_c = printing_Kfold_scores(X_train,y_train)
-------------------------------------------
C parameter:  0.01
-------------------------------------------

Iteration  1 : recall score =  0.492537313433
Iteration  2 : recall score =  0.602739726027
Iteration  3 : recall score =  0.683333333333
Iteration  4 : recall score =  0.569230769231
Iteration  5 : recall score =  0.45

Mean recall score  0.559568228405

-------------------------------------------
C parameter:  0.1
-------------------------------------------

Iteration  1 : recall score =  0.567164179104
Iteration  2 : recall score =  0.616438356164
Iteration  3 : recall score =  0.683333333333
Iteration  4 : recall score =  0.584615384615
Iteration  5 : recall score =  0.525

Mean recall score  0.595310250644

-------------------------------------------
C parameter:  1
-------------------------------------------

Iteration  1 : recall score =  0.55223880597
Iteration  2 : recall score =  0.616438356164
Iteration  3 : recall score =  0.716666666667
Iteration  4 : recall score =  0.615384615385
Iteration  5 : recall score =  0.5625

Mean recall score  0.612645688837

-------------------------------------------
C parameter:  10
-------------------------------------------

Iteration  1 : recall score =  0.55223880597
Iteration  2 : recall score =  0.616438356164
Iteration  3 : recall score =  0.733333333333
Iteration  4 : recall score =  0.615384615385
Iteration  5 : recall score =  0.575

Mean recall score  0.61847902217

-------------------------------------------
C parameter:  100
-------------------------------------------

Iteration  1 : recall score =  0.55223880597
Iteration  2 : recall score =  0.616438356164
Iteration  3 : recall score =  0.733333333333
Iteration  4 : recall score =  0.615384615385
Iteration  5 : recall score =  0.575

Mean recall score  0.61847902217

*********************************************************************************
Best model to choose from cross validation is with C parameter =  10.0
*********************************************************************************
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
Recall metric in the testing dataset:  0.619047619048
output_13_1.png
lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.986394557823
Recall metric in the testing dataset:  0.931972789116
Recall metric in the testing dataset:  0.884353741497
Recall metric in the testing dataset:  0.836734693878
Recall metric in the testing dataset:  0.748299319728
Recall metric in the testing dataset:  0.571428571429
output_14_1.png

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
credit_cards=pd.read_csv('creditcard.csv')

columns=credit_cards.columns
# The labels are in the last column ('Class'). Simply remove it to obtain features columns
features_columns=columns.delete(len(columns)-1)

features=credit_cards[features_columns]
labels=credit_cards['Class']
features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.2, 
                                                                            random_state=0)
#不平衡樣本處理  過采樣方式
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)
len(os_labels[os_labels==1])
227454
os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels)
-------------------------------------------
C parameter:  0.01
-------------------------------------------

Iteration  1 : recall score =  0.890322580645
Iteration  2 : recall score =  0.894736842105
Iteration  3 : recall score =  0.968861347792
Iteration  4 : recall score =  0.957595541926
Iteration  5 : recall score =  0.958430881173

Mean recall score  0.933989438728

-------------------------------------------
C parameter:  0.1
-------------------------------------------

Iteration  1 : recall score =  0.890322580645
Iteration  2 : recall score =  0.894736842105
Iteration  3 : recall score =  0.970410534469
Iteration  4 : recall score =  0.959980655302
Iteration  5 : recall score =  0.960178498807

Mean recall score  0.935125822266

-------------------------------------------
C parameter:  1
-------------------------------------------

Iteration  1 : recall score =  0.890322580645
Iteration  2 : recall score =  0.894736842105
Iteration  3 : recall score =  0.970454796946
Iteration  4 : recall score =  0.96014552489
Iteration  5 : recall score =  0.960596168431

Mean recall score  0.935251182603

-------------------------------------------
C parameter:  10
-------------------------------------------

Iteration  1 : recall score =  0.890322580645
Iteration  2 : recall score =  0.894736842105
Iteration  3 : recall score =  0.97065397809
Iteration  4 : recall score =  0.960343368396
Iteration  5 : recall score =  0.960530220596

Mean recall score  0.935317397966

-------------------------------------------
C parameter:  100
-------------------------------------------

Iteration  1 : recall score =  0.890322580645
Iteration  2 : recall score =  0.894736842105
Iteration  3 : recall score =  0.970543321899
Iteration  4 : recall score =  0.960211472725
Iteration  5 : recall score =  0.960903924995

Mean recall score  0.935343628474

*********************************************************************************
Best model to choose from cross validation is with C parameter =  100.0
*********************************************************************************
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
Recall metric in the testing dataset:  0.90099009901
output_22_1.png

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末过咬,一起剝皮案震驚了整個(gè)濱河市楣富,隨后出現(xiàn)的幾起案子捌锭,更是在濱河造成了極大的恐慌坡贺,老刑警劉巖蒋譬,帶你破解...
    沈念sama閱讀 219,589評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異崖堤,居然都是意外死亡毅糟,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,615評論 3 396
  • 文/潘曉璐 我一進(jìn)店門弛饭,熙熙樓的掌柜王于貴愁眉苦臉地迎上來冕末,“玉大人,你說我怎么就攤上這事侣颂〉堤遥” “怎么了?”我有些...
    開封第一講書人閱讀 165,933評論 0 356
  • 文/不壞的土叔 我叫張陵憔晒,是天一觀的道長藻肄。 經(jīng)常有香客問我,道長拒担,這世上最難降的妖魔是什么仅炊? 我笑而不...
    開封第一講書人閱讀 58,976評論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮澎蛛,結(jié)果婚禮上抚垄,老公的妹妹穿的比我還像新娘。我一直安慰自己谋逻,他們只是感情好呆馁,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,999評論 6 393
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著毁兆,像睡著了一般浙滤。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上气堕,一...
    開封第一講書人閱讀 51,775評論 1 307
  • 那天纺腊,我揣著相機(jī)與錄音,去河邊找鬼茎芭。 笑死揖膜,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的梅桩。 我是一名探鬼主播壹粟,決...
    沈念sama閱讀 40,474評論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼宿百!你這毒婦竟也來了趁仙?” 一聲冷哼從身側(cè)響起洪添,我...
    開封第一講書人閱讀 39,359評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎雀费,沒想到半個(gè)月后干奢,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,854評論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡盏袄,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,007評論 3 338
  • 正文 我和宋清朗相戀三年律胀,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片貌矿。...
    茶點(diǎn)故事閱讀 40,146評論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡炭菌,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出逛漫,到底是詐尸還是另有隱情黑低,我是刑警寧澤,帶...
    沈念sama閱讀 35,826評論 5 346
  • 正文 年R本政府宣布酌毡,位于F島的核電站克握,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏枷踏。R本人自食惡果不足惜菩暗,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,484評論 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望旭蠕。 院中可真熱鬧停团,春花似錦、人聲如沸掏熬。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,029評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽旗芬。三九已至舌胶,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間疮丛,已是汗流浹背幔嫂。 一陣腳步聲響...
    開封第一講書人閱讀 33,153評論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留誊薄,地道東北人履恩。 一個(gè)月前我還...
    沈念sama閱讀 48,420評論 3 373
  • 正文 我出身青樓,卻偏偏與公主長得像暇屋,于是被迫代替她去往敵國和親似袁。 傳聞我的和親對象是個(gè)殘疾皇子洞辣,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,107評論 2 356

推薦閱讀更多精彩內(nèi)容

  • import pandas; from pandas import read_csv; data = read_c...
    正在充電Loading閱讀 2,293評論 1 2
  • 本周學(xué)習(xí)了邏輯回歸的原理思路咐刨,利用歷史考期成績對學(xué)生本輪考試是否通過進(jìn)行預(yù)測昙衅。以下為數(shù)據(jù)涉及字段:省份,學(xué)生ID定鸟,...
    silent_eyes_77閱讀 573評論 0 1
  • 數(shù)據(jù)下載 一而涉、數(shù)據(jù)預(yù)處理 導(dǎo)入庫 導(dǎo)入數(shù)據(jù) 分割數(shù)據(jù)集 數(shù)據(jù)標(biāo)準(zhǔn)化 二、建立邏輯回歸模型 R值(準(zhǔn)確率): 0.8...
    海之瑤閱讀 369評論 1 0
  • 邏輯回歸算法梳理 1联予、 邏輯回歸的原理 邏輯回歸就是這樣的一個(gè)過程:面對一個(gè)回歸或者分類問題啼县,建立代價(jià)函數(shù),然后通...
    敬標(biāo)閱讀 2,691評論 0 1
  • 昨晚抱著手機(jī)睡著了沸久,一早爬起來補(bǔ)功課季眷,那天班長說,你看你同桌每天打卡好積極卷胯,瞬間感覺好慚愧子刮,捂臉 冬至,各種群發(fā)窑睁,...
    ai兔子姐姐閱讀 134評論 1 1