From Linear Regression to Logistic Regression

Binary classification with logistic regression

  • 概率分布
  • response value represents a probablity, between [0,1]

1 . 普通的線性回歸假設(shè)響應(yīng)變量呈正態(tài)分布祟身,又稱高斯分布或鐘形曲線(bell curve)
2 . 若響應(yīng)變量不滿足正態(tài)分布,而是概率事件敷扫,則假設(shè)不滿足
3 . 廣義線性回歸,用聯(lián)連函數(shù)(link function)來描述解釋變量和響應(yīng)變量的關(guān)系
4 . 普通線性回歸作為廣義線性回歸的特例使用的是恒等聯(lián)連函數(shù)(identity link function), 將解釋變量通過線性組合來聯(lián)連服從正態(tài)分布的響應(yīng)變量
5 . 對于邏輯回歸循捺,如果響應(yīng)變量超過某個臨界值攻冷,預(yù)測結(jié)果為陽性,否則為陰性
6 . The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function。the logistic function returns a value between 0 and 1


7 . For logistic function冠骄,t is equal to a linear combination of explanatory variables

Spam filtering(垃圾短信過濾)

1 . explore data and calculate some basic summary statics using pandas

import pandas as pd
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
print ('Number of spam messages:',df[df[0]=='spam'][0].count()) 
print ('Number of ham messages',df[df[0]=='ham'][0].count())

2 . create a TfidfVectorizer, then fit it with training messages, and transform both the training and test messages

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例為test集,type類型為Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)  #生成矩陣
X_test = vectorizer.transform(X_test_raw)   #type為scipy的矩陣

3 . create an instance of LogisticRegression and train the model

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例為test集,type類型為Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)  #生成矩陣
X_test = vectorizer.transform(X_test_raw)   #type為scipy的矩陣
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
    print ('Prediction:%s. Truelabel:%s. Message:%s' % (prediction,y_test.iloc[i],X_test_raw.iloc[i]))    
    #此處必須使用iloc,基于位置的索引。若用X_test_raw[i]會報錯加袋,因為拆分訓(xùn)練凛辣、測試集時,索引也相應(yīng)變了锁荔,尤其針對數(shù)字索引

Binary classification performance metrics(效果度量方法)

預(yù)測陽性 預(yù)測陰性
實際陽性 True Positive False Negative
實際陰性 False Positive True Negative
實際運行時如下蟀给,陽性在下
0 1
0
1
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0,0,0,0,0,1,1,1,1,1]
y_pred = [0,1,0,0,0,0,0,1,1,1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

Accuracy

  • Accuracy measures a fraction of the classifier's predictions that are correct
from sklearn.metrics import accuracy_score
y_pred=[0,1,1,0]
y_true=[1,1,1,1]
print 'Accuracy:',accuracy_score(y_true,y_pred)  #outcome is 0.5
  • evaluate the classifier's accuracy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
#y_pre=classifier.predict(X_test)
#for i,pre in enumerate(y_pre[:5]):
#    print y_pre[i],y_test.iloc[i],X_test_raw.iloc[i]
print 'Accuracy',np.mean(scores), scores
#Outcome:Accuracy 0.955980861244 [ 0.94976077  0.95933014  0.96052632  0.96291866  0.94736842]
  • Drawback
    1 . accuracy can't distinguish between false positive errors and false negative errors
    2 . accuracy is not an informative metrics if the proportions of the class are skewed(傾斜) in the population

Precision and recall 精確率和召回率

  • definition
  1. the precision is the fraction of positive predictions that are correct


  2. recall is the fraction of truly positive instances that the classifier recognizes(被分類器識別出來的真陽性占所有陽性的比例)


  • calculate SMS classifier's precision and recall
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)  
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')  #實際運行報錯,不知為啥
print 'Precision', np.mean(precisions), precisions
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print 'Recall', np.mean(recalls), recalls
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print 'F1:', np.mean(f1s), f1s
#Outcome:
Precision 0.989910506899 [ 0.98591549  1.          0.98850575  0.98795181  0.98717949]
Recall 0.685907046477 [ 0.60344828  0.69565217  0.74782609  0.71304348  0.66956522]
F1: 0.806840977066 [ 0.84102564  0.81675393  0.8042328   0.79144385  0.78074866]

1 . Precision=0.9899 means almost all of the messages that it predicted as spam were actually spam
2 . Recall=0.686 means it incorrectly classified approximately 32 precent of the spam messages as ham

Calculating the F1 measure

ROC AUC

  • unlike accuracy,the ROC curve is insensitive to data sets with unbalanced class proportions
  • ROC curves plot the classi er's recall against its fall-out
  • Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives


  • AUC(area under curve)
    which represents the expected performance of the classifier
  • plot the ROC curve for SMS spam
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])    #將y_test和預(yù)測值進行比較
roc_auc = auc(false_positive_rate, recall)     #計算AUC的值
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)   #'b'表示藍色線條
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning models with grid search(網(wǎng)格搜索調(diào)整模型)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75),
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, 10000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'vect__norm': ('l1', 'l2'),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    df = pd.read_csv('/Users/enniu/Desktop/sms.csv')
    X, y, = df['message'], df['label']
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    grid_search.fit(X_train, y_train)
    print 'Best score: %0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%s: %r' % (param_name, best_parameters[param_name])
    predictions = grid_search.predict(X_test)
    print 'Accuracy:', accuracy_score(y_test, predictions)
    print 'Precision:', precision_score(y_test, predictions)
    print 'Recall:', recall_score(y_test, predictions)
# The following is the output of the script:
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   23.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   52.3s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 12.4min finished
Best score: 0.985
Best parameters set:
    clf__C: 10
    clf__penalty: 'l2'
    vect__max_df: 0.25
    vect__max_features: 2500
    vect__ngram_range: (1, 2)
    vect__norm: 'l2'
    vect__stop_words: None
    vect__use_idf: True
Accuracy: 0.98493543759
Precision: 0.983333333333
Recall: 0.907692307692

Multi-class classification

  • One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末阳堕,一起剝皮案震驚了整個濱河市跋理,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌恬总,老刑警劉巖前普,帶你破解...
    沈念sama閱讀 211,290評論 6 491
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異壹堰,居然都是意外死亡拭卿,警方通過查閱死者的電腦和手機骡湖,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,107評論 2 385
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來峻厚,“玉大人响蕴,你說我怎么就攤上這事』萏遥” “怎么了浦夷?”我有些...
    開封第一講書人閱讀 156,872評論 0 347
  • 文/不壞的土叔 我叫張陵,是天一觀的道長辜王。 經(jīng)常有香客問我劈狐,道長,這世上最難降的妖魔是什么呐馆? 我笑而不...
    開封第一講書人閱讀 56,415評論 1 283
  • 正文 為了忘掉前任肥缔,我火速辦了婚禮,結(jié)果婚禮上汹来,老公的妹妹穿的比我還像新娘续膳。我一直安慰自己,他們只是感情好俗慈,可當我...
    茶點故事閱讀 65,453評論 6 385
  • 文/花漫 我一把揭開白布姑宽。 她就那樣靜靜地躺著遣耍,像睡著了一般闺阱。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上舵变,一...
    開封第一講書人閱讀 49,784評論 1 290
  • 那天酣溃,我揣著相機與錄音,去河邊找鬼纪隙。 笑死赊豌,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的绵咱。 我是一名探鬼主播碘饼,決...
    沈念sama閱讀 38,927評論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼悲伶!你這毒婦竟也來了艾恼?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,691評論 0 266
  • 序言:老撾萬榮一對情侶失蹤麸锉,失蹤者是張志新(化名)和其女友劉穎钠绍,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體花沉,經(jīng)...
    沈念sama閱讀 44,137評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡柳爽,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,472評論 2 326
  • 正文 我和宋清朗相戀三年媳握,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片磷脯。...
    茶點故事閱讀 38,622評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡蛾找,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出赵誓,到底是詐尸還是另有隱情腋粥,我是刑警寧澤,帶...
    沈念sama閱讀 34,289評論 4 329
  • 正文 年R本政府宣布架曹,位于F島的核電站隘冲,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏绑雄。R本人自食惡果不足惜展辞,卻給世界環(huán)境...
    茶點故事閱讀 39,887評論 3 312
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望万牺。 院中可真熱鬧罗珍,春花似錦、人聲如沸脚粟。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,741評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽核无。三九已至扣唱,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間团南,已是汗流浹背噪沙。 一陣腳步聲響...
    開封第一講書人閱讀 31,977評論 1 265
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留吐根,地道東北人正歼。 一個月前我還...
    沈念sama閱讀 46,316評論 2 360
  • 正文 我出身青樓,卻偏偏與公主長得像拷橘,于是被迫代替她去往敵國和親局义。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 43,490評論 2 348

推薦閱讀更多精彩內(nèi)容