R語(yǔ)言中提供了一系列apply()的函數(shù),為數(shù)據(jù)分析中Split-Apply-Combine的策略提供了簡(jiǎn)潔方便的實(shí)現(xiàn),這些函數(shù)的基本工作流程都是首先將數(shù)據(jù)按照某種規(guī)則劃分(split)為較小的幾部分聂儒,然后對(duì)各個(gè)部分應(yīng)用(apply)某些操作,再將結(jié)果整合(combine)起來绵咱。關(guān)于Split-Apply-Combine策略的詳細(xì)內(nèi)容,可以參考Hadley Wickham的The Split-Apply-Combine Strategy for Data Analysis一文。
apply()家族主要有以下7類函數(shù):
base::apply Apply Functions Over Array Margins
base::by Apply a Function to a Data Frame Split by Factors
base::eapply Apply a Function Over Values in an Environment
base::lapply Apply a Function over a List or Vector
base::mapply Apply a Function to Multiple List or Vector Arguments
base::rapply Recursively Apply a Function to a List
base::tapply Apply a Function Over a Ragged Array
1. apply()函數(shù)
# create a matrix of 10 rows x 2 columns
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
# mean of the rows
apply(m, 1, mean)
[1] 6 7 8 9 10 11 12 13 14 15
# mean of the columns
apply(m, 2, mean)
[1] 5.5 15.5
# divide all values by 2
apply(m, 1:2, function(x) x/2)
[,1] [,2]
[1,] 0.5 5.5
[2,] 1.0 6.0
[3,] 1.5 6.5
[4,] 2.0 7.0
[5,] 2.5 7.5
[6,] 3.0 8.0
[7,] 3.5 8.5
[8,] 4.0 9.0
[9,] 4.5 9.5
[10,] 5.0 10.0
2. by()函數(shù)
by函數(shù)是對(duì)于數(shù)據(jù)框按照因子分割然后執(zhí)行某函數(shù)(Apply a Function to a Data Frame Split by Factors)
attach(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# get the mean of the first 4 variables, by species
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
3. eapply()函數(shù)
eapply函數(shù)很少用,它針對(duì)的對(duì)象是不同環(huán)境中的變量計(jì)算良姆,返回的結(jié)果是一個(gè)list.
# a new environment
e <- new.env()
# two environment variables, a and b
e$a <- 1:10
e$b <- 11:20
# mean of the variables
eapply(e, mean)
$b
[1] 15.5
$a
[1] 5.5
正常情況我們不會(huì)創(chuàng)造自己的環(huán)境,但是很多的R包會(huì)使用新的環(huán)境幔戏。
4. lapply()函數(shù)
lapply()中的“l(fā)”代表list玛追,它接受list作為輸入,并將指定的操作應(yīng)用于列表中的所有元素闲延。在list上逐個(gè)元素調(diào)用FUN豹缀。可以用于dataframe上慨代,因?yàn)閐ataframe是一種特殊形式的list
# create a list with 2 elements
l <- list(a = 1:10, b = 11:20)
# the mean of the values in each element
lapply(l, mean)
$a
[1] 5.5
$b
[1] 15.5
# the sum of the values in each element
lapply(l, sum)
$a
[1] 55
$b
[1] 155
5. sapply()函數(shù)
sapply()中的代表simplify邢笙, 其與lapply()的不同之處在于sapply()會(huì)嘗試對(duì)結(jié)果進(jìn)行簡(jiǎn)化,使用sapply()替代lapply()重復(fù)前面例子的操作:
# create a list with 2 elements
l <- list(a = 1:10, b = 11:20)
# mean of values using sapply
sapply(l, mean)
#a b
#5.5 15.5
l.mean <- sapply(l, mean)
# what type of object was returned?
class(l.mean)
#[1] "numeric"
# it's a numeric vector, so we can get element "a" like this
l.mean[['a']]
#[1] 5.5
sapply()自動(dòng)將結(jié)果轉(zhuǎn)換為了character的vector侍匙。具體來說氮惯,如果apply的結(jié)果是一個(gè)所有元素長(zhǎng)度都為1的list叮雳,sapply()會(huì)將結(jié)果轉(zhuǎn)換為vector;如果apply的結(jié)果是一個(gè)所有元素長(zhǎng)度都相等且大于1的list妇汗,sapply()會(huì)將結(jié)果轉(zhuǎn)換為matrix帘不;如果sapply()無法判斷簡(jiǎn)化規(guī)則,則不對(duì)結(jié)果進(jìn)行簡(jiǎn)化杨箭,返回list寞焙,此時(shí)得到的結(jié)果和lapply()相同。
6. vapply()函數(shù)
vapply()與sapply()相似互婿,他可以預(yù)先指定的返回值類型捣郊。使得得到的結(jié)果更加安全。
vapply基本格式是vapply(X, FUN, FUN.VALUE)慈参,其中FUN.VALUE可以寫入自己想要的輸出格式呛牲。
l <- list(a = 1:10, b = 11:20)
# fivenum of values using vapply
l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))
class(l.fivenum)
[1] "matrix"
# let's see it
l.fivenum
a b
Min. 1.0 11.0
1st Qu. 3.0 13.0
Median 5.5 15.5
3rd Qu. 8.0 18.0
Max. 10.0 20.0
7. replicate()函數(shù)
replicate()函數(shù),它可以將某個(gè)函數(shù)重復(fù)運(yùn)行N次驮配,常常用來生成較復(fù)雜的隨機(jī)數(shù)娘扩。
replicate(10, rnorm(10))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.67947001 -1.94649409 0.28144696 0.5872913 2.22715085 -0.275918282
[2,] 1.17298643 -0.01529898 -1.47314092 -1.3274354 -0.04105249 0.528666264
[3,] 0.77272662 -2.36122644 0.06397576 1.5870779 -0.33926083 1.121164338
[4,] -0.42702542 -0.90613885 0.83645668 -0.5462608 -0.87458396 -0.723858258
[5,] -0.73892937 -0.57486661 -0.04418200 -0.1120936 0.08253614 1.319095242
[6,] 2.93827883 -0.33363446 0.55405024 -0.4942736 0.66407615 -0.153623614
[7,] 1.30037496 -0.26207115 0.49818215 1.0774543 -0.28206908 0.825488436
[8,] -0.04153545 -0.23621632 -1.01192741 0.4364413 -2.28991601 -0.002867193
[9,] 0.01262547 0.40247248 0.65816829 0.9541927 -1.63770154 0.328180660
[10,] 0.96525278 -0.37850821 -0.85869035 -0.6055622 1.13756753 -0.371977151
[,7] [,8] [,9] [,10]
[1,] 0.03928297 0.34990909 -0.3159794 1.08871657
[2,] -0.79258805 -0.30329668 -1.0902070 0.73356542
[3,] 0.10673459 -0.02849216 0.8094840 0.06446245
[4,] -0.84584079 -0.57308461 -1.3570979 -0.89801330
[5,] -1.50226560 -2.35751419 1.2104163 0.74650696
[6,] -0.32790991 0.80144695 -0.0071844 0.05742356
[7,] 1.36719970 2.34148354 0.9148911 0.20451421
[8,] -0.51112579 -0.53658159 1.5194130 -0.94250069
[9,] 0.52017814 -1.22252527 0.4519702 0.08779704
[10,] 1.35908918 1.09024342 0.5912627 -0.20709053
8. mapply()函數(shù)
mapply是多變量版的sapply,參數(shù)(...)部分可以接收多個(gè)數(shù)據(jù)壮锻,mapply將FUN應(yīng)用于這些數(shù)據(jù)的第一個(gè)元素組成的數(shù)組琐旁,然后是第二個(gè)元素組成的數(shù)組,以此類推猜绣。要求多個(gè)數(shù)據(jù)的長(zhǎng)度相同灰殴,或者是整數(shù)倍關(guān)系。返回值是vector或matrix途事,取決于FUN返回值是一個(gè)還是多個(gè)验懊。
mapply(sum, list(a=1,b=2,c=3), list(a=10,b=20,d=30))
a b c
11 22 33
mapply(function(x,y) x^y, c(1:5), c(1:5))
[1] 1 4 27 256 3125
mapply(function(x,y) c(x+y, x^y), c(1:5), c(1:5))
[,1] [,2] [,3] [,4] [,5]
[1,] 2 4 6 8 10
[2,] 1 4 27 256 3125
9.rapply()函數(shù)
Description: “rapply is a recursive version of lapply.”
apply是遞歸版的lappy∈洌基本原理是對(duì)list作遍歷义图,如果其中有的元素仍然是list,則繼續(xù)遍歷召烂;對(duì)于每個(gè)非list類型的元素碱工,如果其類型是classes參數(shù)指定的類型之一,則調(diào)用FUN奏夫。classes="ANY"表示匹配所有類型怕篷。
# let's start with our usual simple list example
l <- list(a = 1:10, b = 11:20)
# log2 of each value in the list
rapply(l, log2)
a1 a2 a3 a4 a5 a6 a7 a8
0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
a9 a10 b1 b2 b3 b4 b5 b6
3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000
b7 b8 b9 b10
4.087463 4.169925 4.247928 4.321928
# log2 of each value in each list
rapply(l, log2, how = "list")
$a
[1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
[9] 3.169925 3.321928
$b
[1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925
[9] 4.247928 4.321928
# what if the function is the mean?
rapply(l, mean)
a b
5.5 15.5
rapply(l, mean, how = "list")
$a
[1] 5.5
$b
[1] 15.5
10. tapply()函數(shù)
tapply(array, indices, margin, FUN=NULL, ...)
按indices中的值分組,把相同值對(duì)應(yīng)下標(biāo)的array中的元素形成一個(gè)集合酗昼,應(yīng)用到FUN廊谓。類似于group by indices的操作。如果FUN返回的是一個(gè)值麻削,tapply返回vector蒸痹;若FUN返回多個(gè)值春弥,tapply返回list。vector或list的長(zhǎng)度和indices中不同值的個(gè)數(shù)相等叠荠。
> v <- c(1:5)
> ind <- c('a','a','a','b','b')
> tapply(v, ind)#指示分組的結(jié)果匿沛,按照ind進(jìn)行了分組。
[1] 1 1 1 2 2
> tapply(v, ind, sum)#傳遞了sum函數(shù)榛鼎,按照分組進(jìn)行了計(jì)算逃呼。
a b
6 9
> tapply(v, ind, fivenum)#五分位數(shù)計(jì)算
$a
[1] 1.0 1.5 2.0 2.5 3.0
$b
[1] 4.0 4.0 4.5 5.0 5.0
> m <- matrix(c(1:10), nrow=2)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> ind <- matrix(c(rep(1,5), rep(2,5)), nrow=2)
> ind
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 2 2
[2,] 1 1 2 2 2
> tapply(m, ind)#按照ind對(duì)矩陣m進(jìn)行了indices。
[1] 1 1 1 1 1 2 2 2 2 2
> tapply(m, ind, mean)
1 2
3 8
> tapply(m, ind, fivenum)
$`1`
[1] 1 2 3 4 5
$`2`
[1] 6 7 8 9 10