本文簡(jiǎn)單介紹幾種重抽樣方法 (in R)。
我們生成一組數(shù)據(jù)比原,其中x是我們的觀測(cè)值,y是我們對(duì)其的標(biāo)簽产弹。
# generate random data
set.seed(1111)
x <- c(rnorm(10), rnorm(10, mean=5, sd=5))
y <- c(rep("A", 10), rep("B", 10))
df1 <- data.frame(x,y)
str(df1)
## 'data.frame': 20 obs. of 2 variables:
## $ x: num -0.0866 1.3225 0.6397 1.1748 0.1163 ...
## $ y: chr "A" "A" "A" "A" ...
Permutation
Permutation相當(dāng)于是一種無放回的重抽樣方法,通常用于假設(shè)檢驗(yàn)弯囊。
# single permutation
set.seed(2222)
sample(df1$x, replace = FALSE)
## [1] 9.31952342 -0.08658011 5.56155482 3.71174074 1.17478657 0.63970204
## [7] -2.93084636 0.18759838 1.38404752 1.28394086 0.11629031 6.77816542
## [13] 1.11777194 0.11760093 -4.08500801 1.32252443 4.59743697 -2.67140783
## [19] 0.67750806 9.95418174
我們可以使用Permutation test檢驗(yàn)A痰哨,B兩組的值是否有差異
# permutation test (n=1000)
set.seed(3333)
permt.ls <- list()
for (i in 1:1000) {
permt.i <- sample(df1$x, replace = FALSE)
# calculate the mean of difference from permutated samples
diff.i <- abs(mean(permt.i[1:10]) - mean(permt.i[11:20]))
permt.ls[[i]] <- c(permt.i, diff.i)
}
permt.df <- Reduce(rbind, permt.ls)
diff.raw <- abs(mean(df1$x[1:10]) - mean(df1$x[11:20]))
# Calculate the p-value
mean(permt.df[,21] >= diff.raw)
## [1] 0.079
每一次重抽樣后,我們都可以計(jì)算兩組樣本均值的差異匾嘱。如果重抽樣樣本組間差異大于原始樣本組間差異的話斤斧,可以認(rèn)為是一次錯(cuò)誤事件,通過計(jì)算錯(cuò)誤事件在總重抽樣次數(shù)中的占比就可以得到置換檢驗(yàn)的p值霎烙。
在這里1000次重抽樣中撬讽,只有79次是錯(cuò)誤事件,所以我們的p值為0.079
Bootstrap
Bootstrap是一種有放回的重抽樣方法悬垃,通常用于參數(shù)估計(jì)游昼。
# single bootstrap
set.seed(4444)
sample(df1$x, replace = TRUE)
## [1] 1.1177719 9.9541817 -2.9308464 0.1875984 -2.9308464 -4.0850080
## [7] -4.0850080 4.5974370 0.1162903 0.6397020 6.7781654 4.5974370
## [13] 1.3225244 9.3195234 1.2839409 -4.0850080 0.6397020 1.1177719
## [19] 0.6775081 3.7117407
例如,我們用bootstrap估計(jì)總體的均值
set.seed(5555)
mean.raw <- mean(df1$x)
mean.i <- c()
# bootstrap (n=1000)
for (i in 1:1000) {
boot.i <- sample(df1$x, replace = TRUE)
mean.i[i] <- mean(boot.i)
}
mean.boost <- mean(mean.i)
mean.raw; mean.boost
## [1] 1.908527
## [1] 1.956184
此外尝蠕,我們還可以計(jì)算bootstrap預(yù)測(cè)均值的標(biāo)準(zhǔn)誤(SE)
# calculate the standard error
mean.se.boost <- sqrt(sum((mean.i - mean.boost)^2)/(1000-1))
mean.se.boost
## [1] 0.788522
Jackknife
Jakknife可以被認(rèn)為是一種leave-one-out的重抽樣方法烘豌,對(duì)于大小為k的數(shù)據(jù)集,將產(chǎn)生k個(gè)大小為k-1的樣本.
jack.ls <- list()
for (i in 1:nrow(df1)) {
jack.ls[[i]] <- df1[-i,]
}
length(jack.ls); dim(jack.ls[[1]])
## [1] 20
## [1] 19 2
Cross validation
交叉驗(yàn)證將數(shù)據(jù)切分為測(cè)試集和驗(yàn)證集趟佃,常在模型擬合中使用扇谣。例如k-fold cross validation將數(shù)據(jù)劃分為k組不重疊的數(shù)據(jù)集昧捷。
Figure: Illustration of 5-fold CV (https://yey.world/2020/08/31/MAST90083-05/).
# 1-fold cv
set.seed(6666)
training_size <- round(nrow(df1)*0.7)
training_idx <- sample(nrow(df1), size = training_size, replace = FALSE)
training_set <- df1[training_idx,]
val_set <- df1[-training_idx,]
nrow(training_set); nrow(val_set)
## [1] 14
## [1] 6
有時(shí)候闲昭,由于樣本量太小,無法滿足k-fold CV中不重疊分組的要求靡挥,有的樣本不可避免地被重復(fù)使用序矩。
# 5-fold cv
set.seed(7777)
fold_idx <- list()
val_size <- nrow(df1) - training_size
all_idx <- 1:nrow(df1)
for (i in 1:5) {
fold_idx_i <- unique(unlist(fold_idx))
if (all(all_idx %in% fold_idx_i) | is.null(fold_idx_i)) {
fold_idx[[i]] <- sample(all_idx, val_size, replace = FALSE)
} else if (val_size < length(all_idx[-fold_idx_i])) {
fold_idx[[i]] <- sample(all_idx[-fold_idx_i], val_size, replace = FALSE)
} else if (val_size > length(all_idx[-fold_idx_i])) {
fold_idx[[i]] <- c(all_idx[-fold_idx_i], sample(all_idx, val_size-length(all_idx[-fold_idx_i]), replace = FALSE))
}
}
fold_data <- lapply(fold_idx, function(i) {
list(training = df1[-i,], valdation = df1[i,])
})
fold_idx
## [[1]]
## [1] 16 3 19 5 15 14
##
## [[2]]
## [1] 13 1 18 4 9 17
##
## [[3]]
## [1] 7 20 6 8 2 12
##
## [[4]]
## [1] 10 11 14 13 7 15
##
## [[5]]
## [1] 3 10 2 5 15 12
以上就是對(duì)重抽樣方法的簡(jiǎn)單介紹。
Ref: