[Python] 機(jī)器學(xué)習(xí)筆記基于邏輯回歸的分類預(yù)測

參考資料：
知乎：邏輯回歸 logistics regression 公式推導(dǎo)
知乎：邏輯回歸和SVM的區(qū)別是什么操禀？各適用于解決什么問題？
知乎：LR為什么用sigmoid函數(shù)?這個函數(shù)有什么優(yōu)點和缺點派哲？為什么不用其他函數(shù)蹭睡？
Wiki：Logistic Regression
知乎：為什么 LR 模型要使用 sigmoid 函數(shù)状您，背后的數(shù)學(xué)原理是什么撩幽？
簡書：LR模型的特征歸一化和離散化
 阿里云開發(fā)者社區(qū) AI項目

導(dǎo)學(xué)問題

什么是邏輯回歸（一），邏輯回歸的推導(dǎo)（二 3）退唠，損失函數(shù)的推導(dǎo)（二 4）
邏輯回歸與SVM的異同
邏輯回歸和SVM都用來做分類鹃锈，都是基于回歸的概念

SVM的處理方法是只考慮 support vectors，也就是和分類最相關(guān)的少數(shù)點瞧预，去學(xué)習(xí)分類器
邏輯回歸通過非線性映射屎债，大大減小了離分類平面較遠(yuǎn)的點的權(quán)重，相對提升了與分類最相關(guān)的數(shù)據(jù)點的權(quán)重,兩者的根本目的都是一樣的
svm側(cè)重于超平面邊緣的點垢油，考慮局部（支持向量）盆驹，而logistic回歸側(cè)重于所有點，考慮全局
邏輯回歸與線性回歸的不同
線性回歸的輸出是一個數(shù)值滩愁，而不是一個標(biāo)簽召娜，不能直接解決二分類問題；
邏輯回歸在線性回歸的基礎(chǔ)上惊楼，依托Sigmoid函數(shù)獲取概率玖瘸，通過概率劃分解決二分類問題。
為什么LR需要歸一化或者取對數(shù)檀咙，為什么LR把特征離散化后效果更好
歸一化可以提高收斂速度雅倒，提高收斂的精度
特征離散化的優(yōu)勢有以下幾點：
(1) 邏輯回歸屬于廣義線性模型，表達(dá)能力受限弧可；單變量離散化為N個后蔑匣，每個變量有單獨的權(quán)重，相當(dāng)于為模型引入了非線性棕诵，能夠提升模型表達(dá)能力裁良，加大擬合；
(2) 離散化后可以進(jìn)行特征交叉校套，由M+N個變量變?yōu)镸*N個變量价脾，進(jìn)一步引入非線性，提升表達(dá)能力笛匙；
特征離散化以后侨把，起到了簡化了邏輯回歸模型的作用，降低了模型過擬合的風(fēng)險妹孙。
(3) 離散特征的增加和減少都很容易秋柄，易于模型的快速迭代；
(4) 稀疏向量內(nèi)積乘法運算速度快蠢正，計算結(jié)果方便存儲骇笔，容易擴(kuò)展；
(5) 離散化后的特征對異常數(shù)據(jù)有很強(qiáng)的魯棒性：比如一個特征是年齡>30是1，否則0笨触。如果特征沒有離散化懦傍，一個異常數(shù)據(jù)“年齡300歲”會給模型造成很大的干擾；
(6) 特征離散化后旭旭，模型會更穩(wěn)定，比如如果對用戶年齡離散化葱跋，20-30作為一個區(qū)間持寄，不會因為一個用戶年齡長了一歲就變成一個完全不同的人。
LR為什么用Sigmoid函數(shù)娱俺，這個函數(shù)有什么優(yōu)缺點稍味，為什么不用其他函數(shù)

The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution {\displaystyle y\mid x} y\mid x is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes.
——Logistic regression

首先，在建模預(yù)測 Y|X荠卷，并認(rèn)為 Y|X 服從bernoulli distribution模庐，所以只需要知道 P(Y|X)；其次需要一個線性模型油宜，所以 P(Y|X) = f(wx)掂碱。接下來就只需要知道 f 是什么就行了。通過最大熵原理推出的這個 f慎冤，就是sigmoid疼燥。

一、介紹

邏輯回歸（Logistic regression蚁堤，簡稱LR）醉者，是一個分類模型，主要用于兩分類問題（即輸出只有兩種披诗，分別代表兩個類別）撬即，并且廣泛應(yīng)用于各個領(lǐng)域之中。

邏輯回歸模型的優(yōu)劣勢:

優(yōu)點：實現(xiàn)簡單呈队，易于理解和實現(xiàn)剥槐；計算代價不高，速度很快宪摧，存儲資源低才沧；

缺點：容易欠擬合，分類精度可能不高

線性回歸的輸出是一個數(shù)值绍刮，而不是一個標(biāo)簽温圆，顯然不能直接解決二分類問題。

一個最直觀的辦法就是設(shè)定一個閾值孩革，比如0岁歉，如果預(yù)測的數(shù)值 y > 0 ，那么屬于標(biāo)簽A，反之屬于標(biāo)簽B锅移，采用這種方法的模型又叫做感知機(jī)（Perceptron）熔掺。 ‘
另一種方法，不去直接預(yù)測標(biāo)簽非剃，而是去預(yù)測標(biāo)簽為A概率置逻。概率是一個[0,1]區(qū)間的連續(xù)數(shù)值，那輸出的數(shù)值就是標(biāo)簽為A的概率备绽。一般的如果標(biāo)簽為A的概率大于0.5券坞，就認(rèn)為它是A類，否則就是B類肺素。這就是邏輯回歸模型 (Logistics Regression)恨锚。

二、原理及公式推導(dǎo)

1. Sigmoid函數(shù)

Logistic函數(shù)（或稱為Sigmoid函數(shù)）倍靡，函數(shù)形式為：

對應(yīng)函數(shù)圖像為：

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-5,5,0.01)
y = 1/(1+np.exp(-x))

plt.plot(x,y)
plt.xlabel('z')
plt.ylabel('y')
plt.grid()
plt.show()

模型的值域剛好在[0,1]區(qū)間

2. 線性回歸模型

線性回歸的表達(dá)式：

線性回歸對于給定的輸入x猴伶，輸出的是一個數(shù)值 y ，因此它是一個解決回歸問題的模型塌西。

為了消除掉后面的常數(shù)項b他挎，我們可以令：

也就是說給x多加一項而且值恒為1，這樣b就到了w里面去了捡需，直線方程可以化簡成為：

3. 邏輯回歸模型

結(jié)合sigmoid函數(shù)雇盖，線性回歸函數(shù)，把線性回歸模型的輸出作為sigmoid函數(shù)的輸入
也就是將回歸模型的預(yù)測值代入sigmoid函數(shù)求得概率栖忠，獲取分類
最后就變成了邏輯回歸模型：

假設(shè)已經(jīng)訓(xùn)練好了一組權(quán)值崔挖，只要把我們需要預(yù)測的值代入到上面的方程，輸出的y值就是這個標(biāo)簽為A的概率庵寞，我們就能夠判斷輸入數(shù)據(jù)是屬于哪個類別狸相。實質(zhì)上來說就是利用數(shù)據(jù)求解出對應(yīng)的模型的特定的ω，從而得到一個針對于當(dāng)前數(shù)據(jù)的特征邏輯回歸模型捐川。

邏輯回歸從其原理上來說其實是實現(xiàn)了一個決策邊界

在模型訓(xùn)練完成之后脓鹃，我們獲得了一組n維的權(quán)重向量w跟偏差 b。對于權(quán)重向量w古沥，它的每一個維度的值瘸右，代表了這個維度的特征對于最終分類結(jié)果的貢獻(xiàn)大小。假如這個維度是正岩齿，說明這個特征對于結(jié)果是有正向的貢獻(xiàn)太颤，那么它的值越大，說明這個特征對于分類為正起到的作用越重要盹沈。對于偏差b (Bias)龄章，一定程度代表了正負(fù)兩個類別的判定的容易程度。假如b是0，那么正負(fù)類別是均勻的做裙。如果b大于0岗憋，說明它更容易被分為正類，反之亦然锚贱。根據(jù)邏輯回歸里的權(quán)重向量在每個特征上面的大小仔戈，就能夠?qū)τ诿總€特征的重要程度有一個量化的清楚的認(rèn)識，這就是為什么說邏輯回歸模型有著很強(qiáng)的解釋性的原因拧廊。

4. 損失函數(shù)及推導(dǎo)

損失函數(shù)就是用來衡量模型的輸出與真實輸出的差別

假設(shè)只有兩個標(biāo)簽1和0监徘。我們把采集到的任何一組樣本看做一個事件的話，那么這個事件發(fā)生的概率假設(shè)為p卦绣。我們的模型y的值等于標(biāo)簽為1的概率也就是p耐量。

把單個樣本看做一個事件飞蚓，那么這個事件發(fā)生的概率就是：

等價于（當(dāng)y=1滤港，結(jié)果是p；當(dāng)y=0趴拧，結(jié)果是1-p）：

如果我們采集到了一組數(shù)據(jù)一共N個溅漾，這個合成在一起的合事件發(fā)生的總概率就是將每一個樣本發(fā)生的概率相乘，即采集到這組樣本的概率：

兩邊取對數(shù)得：

這個 F(w) 函數(shù)又叫做它的損失函數(shù)著榴。這里的損失函數(shù)的值等于事件發(fā)生的總概率添履，我們希望它越大越好。但是跟損失的含義有點兒違背脑又，因此也可以在前面取個負(fù)號暮胧。

三、Demo實踐

魔術(shù)方法：
阿里云鏡像源： !pip install pyodps -i "https://mirrors.aliyun.com/pypi/simple/"
Jupyter等實現(xiàn)matplotlib出圖：%matplotlib inline

Step1：庫函數(shù)導(dǎo)入

##  基礎(chǔ)函數(shù)庫
import numpy as np 

## 導(dǎo)入畫圖庫
import matplotlib.pyplot as plt
import seaborn as sns

## 導(dǎo)入邏輯回歸模型函數(shù)
from sklearn.linear_model import LogisticRegression

Step2：訓(xùn)練模型

## 構(gòu)造數(shù)據(jù)集
x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label = np.array([0, 0, 0, 1, 1, 1])

## 調(diào)用邏輯回歸模型
lr_clf = LogisticRegression()

## 用邏輯回歸模型擬合構(gòu)造的數(shù)據(jù)集
lr_clf = lr_clf.fit(x_fearures, y_label) #其擬合方程為 y=w0+w1*x1+w2*x2

Step3：模型參數(shù)查看

##查看其對應(yīng)模型的w
print('the weight of Logistic Regression:',lr_clf.coef_)
##查看其對應(yīng)模型的w0
print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)
##the weight of Logistic Regression:[[0.73462087 0.6947908]]
##the intercept(w0) of Logistic Regression:[-0.03643213]

Step4：數(shù)據(jù)和模型可視化

## 可視化構(gòu)造的數(shù)據(jù)樣本點
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
plt.show()

# 可視化決策邊界
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))

z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

### 可視化預(yù)測新樣本

plt.figure()
## new point 1
x_fearures_new1 = np.array([[0, -1]])
plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## new point 2
x_fearures_new2 = np.array([[1, 2]])
plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## 訓(xùn)練樣本
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

# 可視化決策邊界
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

Step5：模型預(yù)測

##在訓(xùn)練集和測試集上分布利用訓(xùn)練好的模型進(jìn)行預(yù)測
y_label_new1_predict=lr_clf.predict(x_fearures_new1)
y_label_new2_predict=lr_clf.predict(x_fearures_new2)
print('The New point 1 predict class:\n',y_label_new1_predict)
print('The New point 2 predict class:\n',y_label_new2_predict)
##由于邏輯回歸模型是概率預(yù)測模型（前文介紹的p = p(y=1|x,\theta)）问麸，可以利用predict_proba函數(shù)預(yù)測其概率
y_label_new1_predict_proba=lr_clf.predict_proba(x_fearures_new1)
y_label_new2_predict_proba=lr_clf.predict_proba(x_fearures_new2)
print('The New point 1 predict Probability of each class:\n',y_label_new1_predict_proba)
print('The New point 2 predict Probability of each class:\n',y_label_new2_predict_proba)
##TheNewpoint1predictclass:
##[0]
##TheNewpoint2predictclass:
##[1]
##TheNewpoint1predictProbabilityofeachclass:
##[[0.69567724  0.30432276]]
##TheNewpoint2predictProbabilityofeachclass:
##[[0.11983936  0.88016064]]

四往衷、基于鳶尾花（iris）數(shù)據(jù)集的邏輯回歸分類實踐

Step1：函數(shù)庫導(dǎo)入

##  基礎(chǔ)函數(shù)庫
import numpy as np 
import pandas as pd

## 繪圖函數(shù)庫
import matplotlib.pyplot as plt
import seaborn as sns

Step2：數(shù)據(jù)讀取/載入

from sklearn.datasets import load_iris
data = load_iris() #得到數(shù)據(jù)特征
iris_target = data.target #得到數(shù)據(jù)對應(yīng)的標(biāo)簽
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #利用Pandas轉(zhuǎn)化為DataFrame格式

print(data)

Step3：數(shù)據(jù)信息簡單查看

##利用.info()查看數(shù)據(jù)的整體信息
iris_features.info()

##進(jìn)行簡單的數(shù)據(jù)查看
iris_features.head()

iris_features.tail()

##其對應(yīng)的類別標(biāo)簽為，其中0严卖，1席舍，2分別代表'setosa','versicolor','virginica'三種不同花的類別
iris_target

##利用value_counts函數(shù)查看每個類別數(shù)量
pd.Series(iris_target).value_counts()

##對于特征進(jìn)行一些統(tǒng)計描述
iris_features.describe()

Step4:可視化描述

## 合并標(biāo)簽和特征信息
iris_all = iris_features.copy() ##進(jìn)行淺拷貝，防止對于原始數(shù)據(jù)的修改
iris_all['target'] = iris_target

## 特征與標(biāo)簽組合的散點可視化
sns.pairplot(data=iris_all,diag_kind='hist', hue= 'target')
plt.show()

for col in iris_features.columns:
    sns.boxplot(x='target', y=col, saturation=0.5, 
palette='pastel', data=iris_all)
    plt.title(col)
    plt.show()

# 選取其前三個特征繪制三維散點圖
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')

iris_all_class0 = iris_all[iris_all['target']==0].values
iris_all_class1 = iris_all[iris_all['target']==1].values
iris_all_class2 = iris_all[iris_all['target']==2].values
# 'setosa'(0), 'versicolor'(1), 'virginica'(2)
ax.scatter(iris_all_class0[:,0], iris_all_class0[:,1], iris_all_class0[:,2],label='setosa')
ax.scatter(iris_all_class1[:,0], iris_all_class1[:,1], iris_all_class1[:,2],label='versicolor')
ax.scatter(iris_all_class2[:,0], iris_all_class2[:,1], iris_all_class2[:,2],label='virginica')
plt.legend()

plt.show()

Step5:利用邏輯回歸模型在二分類上進(jìn)行訓(xùn)練和預(yù)測

##為了正確評估模型性能哮笆，將數(shù)據(jù)劃分為訓(xùn)練集和測試集来颤，并在訓(xùn)練集上訓(xùn)練模型，在測試集上驗證模型性能稠肘。
from sklearn.model_selection import train_test_split

##選擇其類別為0和1的樣本（不包括類別為2的樣本）
iris_features_part=iris_features.iloc[:100]
iris_target_part=iris_target[:100]

##測試集大小為20%福铅，80%/20%分
x_train,x_test,y_train,y_test=train_test_split(iris_features_part,iris_target_part,test_size=0.2,random_state=2020)

##從sklearn中導(dǎo)入邏輯回歸模型
from sklearn.linear_model import LogisticRegression

##定義邏輯回歸模型
clf=LogisticRegression(random_state=0,solver='lbfgs')

##在訓(xùn)練集上訓(xùn)練邏輯回歸模型
clf.fit(x_train,y_train)

##在訓(xùn)練集上訓(xùn)練邏輯回歸模型
clf.fit(x_train,y_train)

##查看其對應(yīng)的w
print('the weight of Logistic Regression:',clf.coef_)

##查看其對應(yīng)的w0
print('the intercept(w0) of Logistic Regression:',clf.intercept_)

##在訓(xùn)練集和測試集上分布利用訓(xùn)練好的模型進(jìn)行預(yù)測
train_predict=clf.predict(x_train)
test_predict=clf.predict(x_test)

from sklearn import metrics
##利用accuracy（準(zhǔn)確度）【預(yù)測正確的樣本數(shù)目占總預(yù)測樣本數(shù)目的比例】評估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

##查看混淆矩陣(預(yù)測值和真實值的各類情況統(tǒng)計矩陣)
confusion_matrix_result=metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

##利用熱力圖對于結(jié)果進(jìn)行可視化
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result,annot=True,cmap='Blues')
plt.xlabel('Predictedlabels')
plt.ylabel('Truelabels')
plt.show()

Step6:利用邏輯回歸模型在三分類(多分類)上進(jìn)行訓(xùn)練和預(yù)測

##測試集大小為20%，80%/20%分
x_train,x_test,y_train,y_test=train_test_split(iris_features,iris_target,test_size=0.2,random_state=2020)

##定義邏輯回歸模型
clf=LogisticRegression(random_state=0,solver='lbfgs')

##在訓(xùn)練集上訓(xùn)練邏輯回歸模型
clf.fit(x_train,y_train)
# LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
#           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
#           penalty='l2', random_state=0, solver='lbfgs', tol=0.0001,
#           verbose=0, warm_start=False)

##查看其對應(yīng)的w
print('the weight of Logistic Regression:\n',clf.coef_)
##查看其對應(yīng)的w0
print('the intercept(w0) of Logistic Regression:\n',clf.intercept_)
##由于這個是3分類项阴，所有我們這里得到了三個邏輯回歸模型的參數(shù)本讥，其三個邏輯回歸組合起來即可實現(xiàn)三分類

##在訓(xùn)練集和測試集上分布利用訓(xùn)練好的模型進(jìn)行預(yù)測
train_predict=clf.predict(x_train)
test_predict=clf.predict(x_test)
##由于邏輯回歸模型是概率預(yù)測模型（前文介紹的p=p(y=1|x,\theta)）,所有我們可以利用predict_proba函數(shù)預(yù)測其概率

train_predict_proba=clf.predict_proba(x_train)
test_predict_proba=clf.predict_proba(x_test)

print('The test predict Probability of each class:\n',test_predict_proba)
##其中第一列代表預(yù)測為0類的概率，第二列代表預(yù)測為1類的概率，第三列代表預(yù)測為2類的概率拷沸。

##利用accuracy（準(zhǔn)確度）【預(yù)測正確的樣本數(shù)目占總預(yù)測樣本數(shù)目的比例】評估模型效果
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

##查看混淆矩陣
confusion_matrix_result=metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

##利用熱力圖對于結(jié)果進(jìn)行可視化
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result,annot=True,cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

##The confusion matrix result:
##[[10  0   0]
##[0   8   2] 
##[0   2   8]]

[Python] 機(jī)器學(xué)習(xí)筆記 基于邏輯回歸的分類預(yù)測