AUC和ROC

AUC :曲線下面積(Area Under the Curve)

AUROC :接受者操作特征曲線下面積(Area Under the Receiver Operating Characteristic curve)

1. ROC曲線概述

ROC曲線是一種評(píng)價(jià)分類模型的可視化工具站削。ROC的圖形是橫縱坐標(biāo)限定在0-1范圍內(nèi)的曲線罩句,橫坐標(biāo)是假正率FPR(錯(cuò)誤的判斷為正確的概率),縱坐標(biāo)是真正率TPR(正確的判斷為正確的概率)行疏。通常我們認(rèn)為,曲線的凸起程度越高,模型性能越好,而曲線越接近于對(duì)角線斩熊,模型的準(zhǔn)確性越低。

2. AUC

AUC表示ROC曲線下方的面積蒸健,是對(duì)ROC曲線的量化座享。由于ROC曲線的橫縱坐標(biāo)都是0-1婉商,因此AUC是1x1方格中的一部分,其大小在0-1之間渣叛。

3. ROC曲線的繪制

3.1 基礎(chǔ)概念
  • 預(yù)測概率和閾值:
    分類模型的輸出結(jié)果中包含一個(gè)0-1的概率值丈秩,該概率值代表著對(duì)應(yīng)的樣本被預(yù)測為某類別的可能性。然后再通過閾值來進(jìn)行劃分淳衙,概率大于閾值的被判斷為正蘑秽,概率小于閾值的被判斷為負(fù)。
  • TPR和FPR:ROC曲線的橫坐標(biāo)為FPR箫攀,縱坐標(biāo)為TPR肠牲,F(xiàn)PR是錯(cuò)誤的預(yù)測為正的概率,TPR是錯(cuò)誤的預(yù)測為正的概率靴跛。
3.2 ROC曲線繪制步驟
  1. 將全部樣本按概率遞減排序
  2. 閾值從1至0變更缀雳,計(jì)算各閾值下對(duì)應(yīng)的(FPR,TPR)數(shù)值對(duì)梢睛。
  3. 將數(shù)值對(duì)繪于直角坐標(biāo)系中肥印。

4. ROC and AUC in R

# install.packages("pROC")
# install.packages("randomForest")
library(pROC) 
library(randomForest) #Random Forest is a way to classify samples and we can change the threshold that we use to make those decisions.
set.seed(420) # this will make my results match yours
num.samples <- 100
weight <- sort(rnorm(n=num.samples, mean=172, sd=29))
obese <- ifelse(test=(runif(n=num.samples) < (rank(weight)/num.samples)), 
                yes=1, no=0)
obese
plot(x=weight, y=obese)
## fit a logistic regression to the data...
glm.fit=glm(obese ~ weight, family=binomial)
lines(weight, glm.fit$fitted.values)

draw ROC and AUC using pROC

#######################################
##
## draw ROC and AUC using pROC
##
#######################################
## NOTE: By default, the graphs come out looking terrible
## The problem is that ROC graphs should be square, since the x and y axes
## both go from 0 to 1. However, the window in which I draw them isn't square
## so extra whitespace is added to pad the sides.
roc(obese, glm.fit$fitted.values, plot=TRUE)
## Now let's configure R so that it prints the graph as a square.
##
par(pty = "s") ## pty sets the aspect ratio of the plot region. Two options:
##                "s" - creates a square plotting region
##                "m" - (the default) creates a maximal plotting region
roc(obese, glm.fit$fitted.values, plot=TRUE)
## NOTE: By default, roc() uses specificity on the x-axis and the values range
## from 1 to 0. This makes the graph look like what we would expect, but the
## x-axis itself might induce a headache. To use 1-specificity (i.e. the 
## False Positive Rate) on the x-axis, set "legacy.axes" to TRUE.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE)
## If you want to rename the x and y axes...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage")
## We can also change the color of the ROC line, and make it wider...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4)
## If we want to find out the optimal threshold we can store the 
## data used to make the ROC graph in a variable...
roc.info <- roc(obese, glm.fit$fitted.values, legacy.axes=TRUE)
str(roc.info)
## and then extract just the information that we want from that variable.
roc.df <- data.frame(
  tpp=roc.info$sensitivities*100, ## tpp = true positive percentage
  fpp=(1 - roc.info$specificities)*100, ## fpp = false positive precentage
  thresholds=roc.info$thresholds)
head(roc.df) ## head() will show us the values for the upper right-hand corner
## of the ROC graph, when the threshold is so low 
## (negative infinity) that every single sample is called "obese".
## Thus TPP = 100% and FPP = 100%
tail(roc.df) ## tail() will show us the values for the lower left-hand corner
## of the ROC graph, when the threshold is so high (infinity) 
## that every single sample is called "not obese". 
## Thus, TPP = 0% and FPP = 0%
## now let's look at the thresholds between TPP 60% and 80%...
roc.df[roc.df$tpp > 60 & roc.df$tpp < 80,]
## We can calculate the area under the curve...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
## ...and the partial area under the curve.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE, print.auc.x=45, partial.auc=c(100, 90), auc.polygon = TRUE, auc.polygon.col = "#377eb822")
#######################################
##
## Now let's fit the data with a random forest...
##
#######################################
rf.model <- randomForest(factor(obese) ~ weight)
## ROC for random forest
roc(obese, rf.model$votes[,1], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#4daf4a", lwd=4, print.auc=TRUE)
#######################################
##
## Now layer logistic regression and random forest ROC graphs..
##
#######################################
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
plot.roc(obese, rf.model$votes[,1], percent=TRUE, col="#4daf4a", lwd=4, print.auc=TRUE, add=TRUE, print.auc.y=40)
legend("bottomright", legend=c("Logisitic Regression", "Random Forest"), col=c("#377eb8", "#4daf4a"), lwd=4)
#######################################
##
## Now that we're done with our ROC fun, let's reset the par() variables.
## There are two ways to do it...
##
#######################################
par(pty = "m")

參考:
https://www.bilibili.com/video/BV1SK4y1K7v3
https://www.youtube.com/watch?v=qcvAqAH60Yw

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
禁止轉(zhuǎn)載,如需轉(zhuǎn)載請(qǐng)通過簡信或評(píng)論聯(lián)系作者绝葡。
  • 序言:七十年代末深碱,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子藏畅,更是在濱河造成了極大的恐慌敷硅,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,839評(píng)論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件愉阎,死亡現(xiàn)場離奇詭異绞蹦,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)诫硕,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,543評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門坦辟,熙熙樓的掌柜王于貴愁眉苦臉地迎上來刊侯,“玉大人章办,你說我怎么就攤上這事”醭梗” “怎么了藕届?”我有些...
    開封第一講書人閱讀 153,116評(píng)論 0 344
  • 文/不壞的土叔 我叫張陵,是天一觀的道長亭饵。 經(jīng)常有香客問我休偶,道長,這世上最難降的妖魔是什么辜羊? 我笑而不...
    開封第一講書人閱讀 55,371評(píng)論 1 279
  • 正文 為了忘掉前任踏兜,我火速辦了婚禮词顾,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘碱妆。我一直安慰自己肉盹,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,384評(píng)論 5 374
  • 文/花漫 我一把揭開白布疹尾。 她就那樣靜靜地躺著上忍,像睡著了一般。 火紅的嫁衣襯著肌膚如雪纳本。 梳的紋絲不亂的頭發(fā)上窍蓝,一...
    開封第一講書人閱讀 49,111評(píng)論 1 285
  • 那天,我揣著相機(jī)與錄音繁成,去河邊找鬼吓笙。 笑死,一個(gè)胖子當(dāng)著我的面吹牛巾腕,可吹牛的內(nèi)容都是我干的观蓄。 我是一名探鬼主播,決...
    沈念sama閱讀 38,416評(píng)論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼祠墅,長吁一口氣:“原來是場噩夢(mèng)啊……” “哼侮穿!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起毁嗦,我...
    開封第一講書人閱讀 37,053評(píng)論 0 259
  • 序言:老撾萬榮一對(duì)情侶失蹤亲茅,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后狗准,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體克锣,經(jīng)...
    沈念sama閱讀 43,558評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,007評(píng)論 2 325
  • 正文 我和宋清朗相戀三年腔长,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了袭祟。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,117評(píng)論 1 334
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡捞附,死狀恐怖巾乳,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情鸟召,我是刑警寧澤胆绊,帶...
    沈念sama閱讀 33,756評(píng)論 4 324
  • 正文 年R本政府宣布,位于F島的核電站欧募,受9級(jí)特大地震影響压状,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜跟继,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,324評(píng)論 3 307
  • 文/蒙蒙 一种冬、第九天 我趴在偏房一處隱蔽的房頂上張望镣丑。 院中可真熱鬧,春花似錦娱两、人聲如沸传轰。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,315評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽慨蛙。三九已至,卻和暖如春纪挎,著一層夾襖步出監(jiān)牢的瞬間期贫,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,539評(píng)論 1 262
  • 我被黑心中介騙來泰國打工异袄, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留通砍,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,578評(píng)論 2 355
  • 正文 我出身青樓烤蜕,卻偏偏與公主長得像封孙,于是被迫代替她去往敵國和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子讽营,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,877評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容