AUC :曲線下面積(Area Under the Curve)
AUROC :接受者操作特征曲線下面積(Area Under the Receiver Operating Characteristic curve)
1. ROC曲線概述
ROC曲線是一種評(píng)價(jià)分類模型的可視化工具站削。ROC的圖形是橫縱坐標(biāo)限定在0-1范圍內(nèi)的曲線罩句,橫坐標(biāo)是假正率FPR(錯(cuò)誤的判斷為正確的概率),縱坐標(biāo)是真正率TPR(正確的判斷為正確的概率)行疏。通常我們認(rèn)為,曲線的凸起程度越高,模型性能越好,而曲線越接近于對(duì)角線斩熊,模型的準(zhǔn)確性越低。
2. AUC
AUC表示ROC曲線下方的面積蒸健,是對(duì)ROC曲線的量化座享。由于ROC曲線的橫縱坐標(biāo)都是0-1婉商,因此AUC是1x1方格中的一部分,其大小在0-1之間渣叛。
3. ROC曲線的繪制
3.1 基礎(chǔ)概念
- 預(yù)測概率和閾值:
分類模型的輸出結(jié)果中包含一個(gè)0-1的概率值丈秩,該概率值代表著對(duì)應(yīng)的樣本被預(yù)測為某類別的可能性。然后再通過閾值來進(jìn)行劃分淳衙,概率大于閾值的被判斷為正蘑秽,概率小于閾值的被判斷為負(fù)。 - TPR和FPR:ROC曲線的橫坐標(biāo)為FPR箫攀,縱坐標(biāo)為TPR肠牲,F(xiàn)PR是錯(cuò)誤的預(yù)測為正的概率,TPR是錯(cuò)誤的預(yù)測為正的概率靴跛。
3.2 ROC曲線繪制步驟
- 將全部樣本按概率遞減排序
- 閾值從1至0變更缀雳,計(jì)算各閾值下對(duì)應(yīng)的(FPR,TPR)數(shù)值對(duì)梢睛。
- 將數(shù)值對(duì)繪于直角坐標(biāo)系中肥印。
4. ROC and AUC in R
# install.packages("pROC")
# install.packages("randomForest")
library(pROC)
library(randomForest) #Random Forest is a way to classify samples and we can change the threshold that we use to make those decisions.
set.seed(420) # this will make my results match yours
num.samples <- 100
weight <- sort(rnorm(n=num.samples, mean=172, sd=29))
obese <- ifelse(test=(runif(n=num.samples) < (rank(weight)/num.samples)),
yes=1, no=0)
obese
plot(x=weight, y=obese)
## fit a logistic regression to the data...
glm.fit=glm(obese ~ weight, family=binomial)
lines(weight, glm.fit$fitted.values)
draw ROC and AUC using pROC
#######################################
##
## draw ROC and AUC using pROC
##
#######################################
## NOTE: By default, the graphs come out looking terrible
## The problem is that ROC graphs should be square, since the x and y axes
## both go from 0 to 1. However, the window in which I draw them isn't square
## so extra whitespace is added to pad the sides.
roc(obese, glm.fit$fitted.values, plot=TRUE)
## Now let's configure R so that it prints the graph as a square.
##
par(pty = "s") ## pty sets the aspect ratio of the plot region. Two options:
## "s" - creates a square plotting region
## "m" - (the default) creates a maximal plotting region
roc(obese, glm.fit$fitted.values, plot=TRUE)
## NOTE: By default, roc() uses specificity on the x-axis and the values range
## from 1 to 0. This makes the graph look like what we would expect, but the
## x-axis itself might induce a headache. To use 1-specificity (i.e. the
## False Positive Rate) on the x-axis, set "legacy.axes" to TRUE.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE)
## If you want to rename the x and y axes...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage")
## We can also change the color of the ROC line, and make it wider...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4)
## If we want to find out the optimal threshold we can store the
## data used to make the ROC graph in a variable...
roc.info <- roc(obese, glm.fit$fitted.values, legacy.axes=TRUE)
str(roc.info)
## and then extract just the information that we want from that variable.
roc.df <- data.frame(
tpp=roc.info$sensitivities*100, ## tpp = true positive percentage
fpp=(1 - roc.info$specificities)*100, ## fpp = false positive precentage
thresholds=roc.info$thresholds)
head(roc.df) ## head() will show us the values for the upper right-hand corner
## of the ROC graph, when the threshold is so low
## (negative infinity) that every single sample is called "obese".
## Thus TPP = 100% and FPP = 100%
tail(roc.df) ## tail() will show us the values for the lower left-hand corner
## of the ROC graph, when the threshold is so high (infinity)
## that every single sample is called "not obese".
## Thus, TPP = 0% and FPP = 0%
## now let's look at the thresholds between TPP 60% and 80%...
roc.df[roc.df$tpp > 60 & roc.df$tpp < 80,]
## We can calculate the area under the curve...
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
## ...and the partial area under the curve.
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE, print.auc.x=45, partial.auc=c(100, 90), auc.polygon = TRUE, auc.polygon.col = "#377eb822")
#######################################
##
## Now let's fit the data with a random forest...
##
#######################################
rf.model <- randomForest(factor(obese) ~ weight)
## ROC for random forest
roc(obese, rf.model$votes[,1], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#4daf4a", lwd=4, print.auc=TRUE)
#######################################
##
## Now layer logistic regression and random forest ROC graphs..
##
#######################################
roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)
plot.roc(obese, rf.model$votes[,1], percent=TRUE, col="#4daf4a", lwd=4, print.auc=TRUE, add=TRUE, print.auc.y=40)
legend("bottomright", legend=c("Logisitic Regression", "Random Forest"), col=c("#377eb8", "#4daf4a"), lwd=4)
#######################################
##
## Now that we're done with our ROC fun, let's reset the par() variables.
## There are two ways to do it...
##
#######################################
par(pty = "m")
參考:
https://www.bilibili.com/video/BV1SK4y1K7v3
https://www.youtube.com/watch?v=qcvAqAH60Yw