Binary classification with logistic regression
- 概率分布
- response value represents a probablity, between [0,1]
1 . 普通的線性回歸假設(shè)響應(yīng)變量呈正態(tài)分布祟身,又稱高斯分布或鐘形曲線(bell curve)
2 . 若響應(yīng)變量不滿足正態(tài)分布,而是概率事件敷扫,則假設(shè)不滿足
3 . 廣義線性回歸,用聯(lián)連函數(shù)(link function)來描述解釋變量和響應(yīng)變量的關(guān)系
4 . 普通線性回歸作為廣義線性回歸的特例使用的是恒等聯(lián)連函數(shù)(identity link function), 將解釋變量通過線性組合來聯(lián)連服從正態(tài)分布的響應(yīng)變量
5 . 對于邏輯回歸循捺,如果響應(yīng)變量超過某個臨界值攻冷,預(yù)測結(jié)果為陽性,否則為陰性
6 . The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function。the logistic function returns a value between 0 and 1
7 . For logistic function冠骄,t is equal to a linear combination of explanatory variables
Spam filtering(垃圾短信過濾)
1 . explore data and calculate some basic summary statics using pandas
import pandas as pd
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
print ('Number of spam messages:',df[df[0]=='spam'][0].count())
print ('Number of ham messages',df[df[0]=='ham'][0].count())
2 . create a TfidfVectorizer, then fit it with training messages, and transform both the training and test messages
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0]) #25%的比例為test集,type類型為Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw) #生成矩陣
X_test = vectorizer.transform(X_test_raw) #type為scipy的矩陣
3 . create an instance of LogisticRegression and train the model
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0]) #25%的比例為test集,type類型為Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw) #生成矩陣
X_test = vectorizer.transform(X_test_raw) #type為scipy的矩陣
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
print ('Prediction:%s. Truelabel:%s. Message:%s' % (prediction,y_test.iloc[i],X_test_raw.iloc[i]))
#此處必須使用iloc,基于位置的索引。若用X_test_raw[i]會報錯加袋,因為拆分訓(xùn)練凛辣、測試集時,索引也相應(yīng)變了锁荔,尤其針對數(shù)字索引
Binary classification performance metrics(效果度量方法)
預(yù)測陽性 | 預(yù)測陰性 | |
---|---|---|
實際陽性 | True Positive | False Negative |
實際陰性 | False Positive | True Negative |
實際運行時如下蟀给,陽性在下
0 | 1 | ||
---|---|---|---|
0 | |||
1 |
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0,0,0,0,0,1,1,1,1,1]
y_pred = [0,1,0,0,0,0,0,1,1,1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Accuracy
- Accuracy measures a fraction of the classifier's predictions that are correct
from sklearn.metrics import accuracy_score
y_pred=[0,1,1,0]
y_true=[1,1,1,1]
print 'Accuracy:',accuracy_score(y_true,y_pred) #outcome is 0.5
- evaluate the classifier's accuracy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
#y_pre=classifier.predict(X_test)
#for i,pre in enumerate(y_pre[:5]):
# print y_pre[i],y_test.iloc[i],X_test_raw.iloc[i]
print 'Accuracy',np.mean(scores), scores
#Outcome:Accuracy 0.955980861244 [ 0.94976077 0.95933014 0.96052632 0.96291866 0.94736842]
- Drawback
1 . accuracy can't distinguish between false positive errors and false negative errors
2 . accuracy is not an informative metrics if the proportions of the class are skewed(傾斜) in the population
Precision and recall 精確率和召回率
- definition
-
the precision is the fraction of positive predictions that are correct
-
recall is the fraction of truly positive instances that the classifier recognizes(被分類器識別出來的真陽性占所有陽性的比例)
- calculate SMS classifier's precision and recall
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision') #實際運行報錯,不知為啥
print 'Precision', np.mean(precisions), precisions
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print 'Recall', np.mean(recalls), recalls
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print 'F1:', np.mean(f1s), f1s
#Outcome:
Precision 0.989910506899 [ 0.98591549 1. 0.98850575 0.98795181 0.98717949]
Recall 0.685907046477 [ 0.60344828 0.69565217 0.74782609 0.71304348 0.66956522]
F1: 0.806840977066 [ 0.84102564 0.81675393 0.8042328 0.79144385 0.78074866]
1 . Precision=0.9899 means almost all of the messages that it predicted as spam were actually spam
2 . Recall=0.686 means it incorrectly classified approximately 32 precent of the spam messages as ham
Calculating the F1 measure
ROC AUC
- unlike accuracy,the ROC curve is insensitive to data sets with unbalanced class proportions
- ROC curves plot the classi er's recall against its fall-out
-
Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives
- AUC(area under curve)
which represents the expected performance of the classifier - plot the ROC curve for SMS spam
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1]) #將y_test和預(yù)測值進行比較
roc_auc = auc(false_positive_rate, recall) #計算AUC的值
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc) #'b'表示藍色線條
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
Tuning models with grid search(網(wǎng)格搜索調(diào)整模型)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('/Users/enniu/Desktop/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)
# The following is the output of the script:
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 23.8s
[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 52.3s
[Parallel(n_jobs=-1)]: Done 792 tasks | elapsed: 1.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks | elapsed: 2.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks | elapsed: 3.7min
[Parallel(n_jobs=-1)]: Done 2442 tasks | elapsed: 5.1min
[Parallel(n_jobs=-1)]: Done 3192 tasks | elapsed: 6.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 12.4min finished
Best score: 0.985
Best parameters set:
clf__C: 10
clf__penalty: 'l2'
vect__max_df: 0.25
vect__max_features: 2500
vect__ngram_range: (1, 2)
vect__norm: 'l2'
vect__stop_words: None
vect__use_idf: True
Accuracy: 0.98493543759
Precision: 0.983333333333
Recall: 0.907692307692
Multi-class classification
- One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance