一,邏輯回歸的應(yīng)用場景?
廣告點擊率是否為垃圾郵件是否患病金融詐騙虛假賬號?
二瓢剿,邏輯回歸的原理?
1废膘,輸入?
邏輯回歸的輸入是線性回歸的結(jié)果:??
2酝惧,激活函數(shù)?
1)sigmoid函數(shù)?
? 回歸的結(jié)果輸入到sigmod函數(shù)當(dāng)中?
輸出結(jié)果:[0渠啊,1]區(qū)間中的一個概率值,默認(rèn)為0.5的門限值?
2)注意:?
邏輯回歸的最終分類是通過某個類別的概率來判斷是否屬于某個類別担孔,并且這個類別默認(rèn)標(biāo)記為1(正例)江锨,另一個標(biāo)記為0(反例)。默認(rèn)目標(biāo)值少的為正例糕篇。?
3啄育,損失函數(shù)?
1)對數(shù)似然損失公式?
邏輯回歸的損失,稱之為對數(shù)似然損失拌消,公式如下:? ??
2)綜合完整損失函數(shù)如下:?
3)理解對數(shù)似然損失示例如下:?
?如上可知挑豌,降低損失需要(正例減少sigmoid返回結(jié)果,反例增加sigmod返回結(jié)果)?
4墩崩,優(yōu)化方法?
同樣使用梯度下降優(yōu)化算法氓英,去減少損失函數(shù)的值,這樣去更新邏輯回歸前面對應(yīng)算法的權(quán)重參數(shù)鹦筹,提升原本屬于1類別的概率铝阐,降低原本為0類別的概率。?
三盛龄,邏輯回歸API?
sklearn.linear_model.LogisticRegression(solver=‘liblinear’,penalty=‘i2’,c=1.0)?
solver:優(yōu)化求解方式(默認(rèn)開源的liblinear庫實現(xiàn)饰迹,內(nèi)部使用了坐標(biāo)軸下降法來迭代優(yōu)化損失函數(shù))?
? sag:根據(jù)數(shù)據(jù)集自動選擇,隨機平局梯度下降 penalty:正則化種類c:正則化力度?
?默認(rèn)將類別數(shù)量少的當(dāng)正例?
四余舶,案例:癌癥分類預(yù)測?
數(shù)據(jù)源:https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
def logisticregression():
? ? '''邏輯回歸癌癥預(yù)測'''
? ? # 確定數(shù)據(jù)columns數(shù)值
? ? columns = ["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class"]
? ? data = pd.read_csv("breast-cancer-wisconsin.data",names=columns)
? ? # 去掉缺失值
? ? data.replace(to_replace="?",value=np.nan,inplace=True)
? ? data.dropna(axis=0,inplace=True,how="any")
? ? # 提取目標(biāo)值
? ? target = data["Class"]
? ? # 提取特征值
? ? data = data.drop(["Sample code number"],axis=1).iloc[:,:-1]
? ? # 切割訓(xùn)練集和測試集
? ? x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.3)
? ? # 進(jìn)行標(biāo)準(zhǔn)化
? ? std = StandardScaler()
? ? x_train = std.fit_transform(x_train)
? ? x_test = std.fit_transform(x_test)
? ? # 邏輯回歸進(jìn)行訓(xùn)練和預(yù)測
? ? lr = LogisticRegression()
? ? lr.fit(x_train,y_train)
? ? print("邏輯回歸權(quán)重:",lr.coef_)
? ? print("邏輯回歸偏置:",lr.intercept_)
? ? # 邏輯回歸測試集預(yù)測結(jié)果
? ? pre_result = lr.predict(x_test)
? ? print(pre_result)
? ? # 邏輯回歸預(yù)測準(zhǔn)確率
? ? sore = lr.score(x_test,y_test)
? ? print(sore)
if __name__ == '__main__':
? ? logisticregression()
五啊鸭,二分類的評估方法–(精確率(Precision)與召回率(Recall))?
1,精確率:?
預(yù)測結(jié)果為正例樣本中真是為整理的比例(查的準(zhǔn))??
2匿值,召回率:?
真是為正例的樣本中預(yù)測結(jié)果為正例的比例(查的全赠制,對正樣本的區(qū)分能力)??
3,F(xiàn)1-score?
反應(yīng)了模型的穩(wěn)健型??
4挟憔,模型評估API?
sklearn.metrics.classification_report(y_true,y_pred,target_names=None)?
y_true: 真實目標(biāo)值y_pred: 估計器預(yù)測目標(biāo)值target_names: 目標(biāo)類名稱return: 每個類別精準(zhǔn)率與召回率?
5钟些,代碼?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
def logisticregression():
? ? '''邏輯回歸癌癥預(yù)測'''
? ? # 確定數(shù)據(jù)columns數(shù)值
? ? columns = ["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class"]
? ? data = pd.read_csv("breast-cancer-wisconsin.data",names=columns)
? ? # 去掉缺失值
? ? data.replace(to_replace="?",value=np.nan,inplace=True)
? ? data.dropna(axis=0,inplace=True,how="any")
? ? # 提取目標(biāo)值
? ? target = data["Class"]
? ? # 提取特征值
? ? data = data.drop(["Sample code number"],axis=1).iloc[:,:-1]
? ? # 切割訓(xùn)練集和測試集
? ? x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.3)
? ? # 進(jìn)行標(biāo)準(zhǔn)化
? ? std = StandardScaler()
? ? x_train = std.fit_transform(x_train)
? ? x_test = std.fit_transform(x_test)
? ? # 邏輯回歸進(jìn)行訓(xùn)練和預(yù)測
? ? lr = LogisticRegression()
? ? lr.fit(x_train,y_train)
? ? # 得到訓(xùn)練集返回數(shù)據(jù)
? ? # print("邏輯回歸權(quán)重:",lr.coef_)
? ? # print("邏輯回歸偏置:",lr.intercept_)
? ? # 邏輯回歸測試集預(yù)測結(jié)果
? ? pre_result = lr.predict(x_test)
? ? # print(pre_result)
? ? # 邏輯回歸預(yù)測準(zhǔn)確率
? ? sore = lr.score(x_test,y_test)
? ? print(sore)
? ? # 精確率(Precision)與召回率(Recall)
? ? report = classification_report(y_test,pre_result,target_names=["良性","惡性"])
? ? print(report)
if __name__ == '__main__':
? ? logisticregression()
六,ROC曲線與AUC指標(biāo)?
?問題:如何衡量樣本不均衡下的評估绊谭??
1政恍,ROC曲線與FPR?
TPR = TP / (TP + FN)?
? 所有真實類別為1的樣本中,預(yù)測類別為1的比例 FPR = FP / (TP + FN)?
? 所有真實類別為0的樣本中达传,預(yù)測類別為1的比例??
2篙耗,ROC曲線?
ROC曲線的橫軸就是FPRate,縱軸就是TPRate宪赶,當(dāng)二者相等時宗弯,表示的意義則是:對于不論真實類別時1還是0的樣本,分類器預(yù)測為1的概率是相等的搂妻,此時AUC為0.5 蒙保。??
3,AUC指標(biāo)?
AUC的概率意義時隨機取一對正負(fù)樣本欲主,正樣本得分大于負(fù)樣本的概率邓厕。AUC的最小值為0.5逝嚎,最大值為1,取值越高越好邑狸。AUC=1,完美分類器懈糯,采用這個預(yù)測模型時,不管設(shè)定什么門限值都能得出完美預(yù)測单雾。絕大多數(shù)預(yù)測的場合赚哗,不存在完美分類器。0.5<AUC<1,優(yōu)于隨機猜測. 這個分類器(模型)妥善設(shè)定限制的話,能有預(yù)測價值硅堆。AUC=0.5屿储,跟隨機猜測一樣(例:丟銅板),模型沒有預(yù)測價值渐逃。AUC<0.5够掠,比隨機猜測還差;但只要總是反預(yù)測而行茄菊,就優(yōu)于隨機猜測疯潭,因此不存在AIC<0.5的情況?
?最終AUC的范圍在[0.5,1],并且越接近1越好面殖。?
4竖哩,AUC計算API?
from sklearn.metrics import roc_auc_score?
sklearn.metrics.roc_auc_score(y_true,y_score)?
? 計算ROC曲線面積,即AUC值y_true:每個樣本的真是類別脊僚,必須為0(反例)相叁,1(正例)標(biāo)記y_score:每個樣本預(yù)測的概率值??
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,roc_auc_score
def logisticregression():
? ? '''邏輯回歸癌癥預(yù)測'''
? ? # 確定數(shù)據(jù)columns數(shù)值
? ? columns = ["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape","Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class"]
? ? data = pd.read_csv("breast-cancer-wisconsin.data",names=columns)
? ? # 去掉缺失值
? ? data.replace(to_replace="?",value=np.nan,inplace=True)
? ? data.dropna(axis=0,inplace=True,how="any")
? ? # 提取目標(biāo)值
? ? target = data["Class"]
? ? # 提取特征值
? ? data = data.drop(["Sample code number"],axis=1).iloc[:,:-1]
? ? # 切割訓(xùn)練集和測試集
? ? x_train,x_test,y_train,y_test = train_test_split(data,target,test_size=0.3)
? ? # 進(jìn)行標(biāo)準(zhǔn)化
? ? std = StandardScaler()
? ? x_train = std.fit_transform(x_train)
? ? x_test = std.fit_transform(x_test)
? ? # 邏輯回歸進(jìn)行訓(xùn)練和預(yù)測
? ? lr = LogisticRegression()
? ? lr.fit(x_train,y_train)
? ? # 得到訓(xùn)練集返回數(shù)據(jù)
? ? # print("邏輯回歸權(quán)重:",lr.coef_)
? ? # print("邏輯回歸偏置:",lr.intercept_)
? ? # 邏輯回歸測試集預(yù)測結(jié)果
? ? pre_result = lr.predict(x_test)
? ? # print(pre_result)
? ? # 邏輯回歸預(yù)測準(zhǔn)確率
? ? sore = lr.score(x_test,y_test)
? ? print(sore)
? ? # 精確率(Precision)與召回率(Recall)
? ? report = classification_report(y_test,pre_result,target_names=["良性","惡性"])
? ? print(report)
? ? # 查看AUC指標(biāo)
? ? y_test = np.where(y_test>2.5,1,0)
? ? print(y_test)
? ? auc_score = roc_auc_score(y_test,pre_result)
? ? print(auc_score)
if __name__ == '__main__':
? ? logisticregression()
5,總結(jié)?
AUC只能用來評價二分類AUC非常適合評價樣本不平衡中的分類器性能AUC會比較預(yù)測出來的概率辽幌,而不僅僅是標(biāo)簽類AUC指標(biāo)大于0.7增淹,一般都是比較好的分類器?
七,Scikit-learn的算法實現(xiàn)總結(jié)?
scikit-learn把梯度下降求解的單獨分開乌企,叫SGDclassifier和SGDRegressor虑润,他們的損失都是分類和回歸對應(yīng)的損失,比如分類:有l(wèi)og loss, 和 hingle loss(SVM)的加酵,回歸如:比如 均方誤差, 其它的API是一樣的損失端辱,求解并不是用梯度下降的,所以一般大規(guī)模的數(shù)據(jù)都是用scikit-learn其中SGD的方式求解虽画。