這是英文版的4杂拨、5章節(jié)
4 Workflow: basics
-
括號(hào)好東西檐迟,能幫你在賦值的時(shí)候同時(shí)打印變量补胚,這點(diǎn)在Rmarkdown的時(shí)候也很有幫助
y <- seq(1, 10, length.out = 5) y # > [1] 1.00 3.25 5.50 7.75 10.00 (y <- seq(1, 10, length.out = 5)) #> [1] 1.00 3.25 5.50 7.75 10.00
-
快捷鍵 Alt + Shift + K. 會(huì)幫你把所有R studio的快捷鍵呈現(xiàn)出來
有一點(diǎn)要注意的是,R studio的一些快捷鍵在輸入法狀態(tài)下是不能用的
5.1 Introduction
-
這章講了5個(gè)函數(shù)
- Pick observations by their values (
filter()
).
- Pick observations by their values (
-
Reorder the rows (
arrange()
).- Pick variables by their names (
select()
).
- Pick variables by their names (
-
Create new variables with functions of existing variables (
mutate()
).- Collapse many values down to a single summary (
summarise()
).
- Collapse many values down to a single summary (
-
All verbs work similarly:
The first argument is a data frame.
-
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
# tab可以自動(dòng)打出變量名(比如這里的year)的追迟,這點(diǎn)很方便溶其,但似乎只有在這種情況下才可以 flights %>% select(., year) # 這種似乎就不行 select(flights, year) # 還有就是記住變量名是沒有引號(hào)的 filter(flights, month == 1, day == 1)
The result is a new data frame.
5個(gè)函數(shù)可以和group_by 函數(shù)聯(lián)用,which changes the scope of each function from operating on the entire dataset to operating on it group-by-group
但有時(shí)候得記得ungroup
5.2 Filter rows with filter()
-
浮點(diǎn)數(shù)的問題
sqrt(2) ^ 2 == 2 #> [1] FALSE 1 / 49 * 49 == 1 #> [1] FALSE
這是因?yàn)橛?jì)算機(jī)是用有限精度來處理這些問題的敦间。碰到這些問題的時(shí)候瓶逃,考慮使用near函數(shù)
near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE
comparison operators(比較運(yùn)算符) 以及 Logical operators(邏輯運(yùn)算符) 再次記住這兩個(gè)英文束铭,方便你有問題查詢google
-
這個(gè)圖片挺好的
image -
在執(zhí)行 | 的時(shí)候有一個(gè)有意思的事情
# 我們會(huì)看到下面的輸出結(jié)果是month==1的時(shí)候,而不是我們預(yù)期的是11或者12 > head(filter(flights, month == (11 | 12)), n = 3) # A tibble: 3 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 # ... with 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, # distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm> # 是因?yàn)槭紫?> 11 | 12 [1] TRUE # 然后TRUE在數(shù)字環(huán)境的時(shí)候厢绝,就會(huì)變成1 # 那么結(jié)果就變成了 month == 1 > TRUE == 1 [1] TRUE > head(filter(flights, month == 1), n = 3) # A tibble: 3 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 # ... with 12 more variables: sched_arr_time <int>, # arr_delay <dbl>, carrier <chr>, flight <int>, # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, # distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>
-
x %in% y 是挺好的函數(shù)契沫。This will select every row where x is one of the values in y .
nov_dec <- filter(flights, month %in% c(11, 12))
這會(huì)讓你的函數(shù)會(huì)更加的壓縮,比如這里有4,11,12的話昔汉,相比于 == 而言會(huì)更好寫一點(diǎn)
filter(flights, month %in% c(4, 11, 12)) filter(flights, month == 4 | month == 11 | month == 12)
-
remembering De Morgan’s law:
!(x & y)
is the same as!x | !y
, and!(x | y)
is the same as!x & !y
.# 這里原書有錯(cuò)誤:weren’t delayed (on arrival or departure) by more than two hours # 原書是或懈万,但這里表述的是且 filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120)
&& 和 || 我們在會(huì)后面遇到,記住這里不要在這里用靶病!我們應(yīng)該用的是 | 和 &
-
Missing Value会通,即 NA (“not availables”) 。我能想到的NA場景就是在RNA-Seq這類的p-value里面的結(jié)果遇到娄周。
# NA是具有傳染性的 # 因?yàn)镹A代表著有這個(gè)數(shù)據(jù)涕侈,但你不知道是什么,所以對于NA的一切操作結(jié)果煤辨,你都只能是不知道裳涛,即NA # 當(dāng)然,也有例子众辨,見下面 NA > 5 #> [1] NA 10 == NA #> [1] NA NA + 10 #> [1] NA NA / 2 #> [1] NA ------------------------------------------------ NA == NA #> [1] NA # It’s easiest to understand why this is true with a bit more context: # Let x be Mary's age. We don't know how old she is. x <- NA # Let y be John's age. We don't know how old he is. y <- NA # Are John and Mary the same age? x == y #> [1] NA # We don't know! -------------------------------------------------- is.na(x) #> [1] TRUE # filter只會(huì)包含condition是TRUE的调违,并不會(huì)保留 NA 的 df <- tibble(x = c(1, NA, 3)) filter(df, x > 1) #> # A tibble: 1 x 1 #> x #> <dbl> #> 1 3 # 想要NA結(jié)果的話得可以自己搞一個(gè) filter(df, is.na(x) | x > 1) #> # A tibble: 2 x 1 #> x #> <dbl> #> 1 NA #> 2 3
-
Exercise 5.2.2
Another useful dplyr filtering helper is
between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?關(guān)于between的
Description
This is a shortcut for x >= left & x <= right, implemented efficiently in C++ for local values, and translated to the appropriate SQL for remote tables.Usage
between(x, left, right)# 之前我們寫 filter(flights, month %in% 7:9) # 現(xiàn)在我們還可以用between filter(flights, between(month, 7,9)) # 其實(shí)我感覺filter就是把下面步驟整合了 flights[between(flights$month, 7, 9),] # 對于between,我們還可以驗(yàn)證標(biāo)準(zhǔn)正態(tài)分布的-1到1占據(jù)了68.3%的面積 > x <- rnorm(1e7) > length(x[between(x, -1, 1)]) / 1e7 [1] 0.6827031
-
Exercise 5.2.3
How many flights have a missing
dep_time
? What other variables are missing? What might these rows represent?filter(flights, is.na(dep_time)) #> # A tibble: 8,255 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time #> <int> <int> <int> <int> <int> <dbl> <int> #> 1 2013 1 1 NA 1630 NA NA #> 2 2013 1 1 NA 1935 NA NA #> 3 2013 1 1 NA 1500 NA NA #> 4 2013 1 1 NA 600 NA NA #> 5 2013 1 2 NA 1540 NA NA #> 6 2013 1 2 NA 1620 NA NA #> # … with 8,249 more rows, and 12 more variables: sched_arr_time <int>, #> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <dttm>
-
Exercise 5.2.4
Why is
NA ^ 0
not missing? Why isNA | TRUE
not missing? Why isFALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counter example!)
# 任何數(shù)的0次方都是1
# 這個(gè)特性對于scale的時(shí)候很有用泻轰,因?yàn)橛袝r(shí)候你對于所有數(shù)都是均等進(jìn)行scale的話技肩,會(huì)由于方差是0(因?yàn)榉讲钍欠帜福┒鵁o法返回值,這時(shí)候你就可以 x - mean(x) / (sd(x) ^ 0) 來解決這個(gè)問題了
> NA ^ 0
[1] 1
# 因?yàn)?# anything and FALSE is always FALSE.
# anything or TRUE is always TRUE
> NA | TRUE
[1] TRUE
> FALSE & NA
[1] FALSE
# 而這里就不是了浮声,我們不確定NA是什么虚婿,而結(jié)果又是會(huì)根據(jù)NA值的不同而變化的,所以返回值就是NA了
> NA | FALSE
[1] NA
> NA & TRUE
[1] NA
# 關(guān)于這個(gè)問題泳挥,我更喜歡Quaro的回答
# https://www.quora.com/In-R-why-is-NA*0-not-equal-to-0
# 即NA可以代表任何值然痊,可以代表0也可以代表NaN。
# 而 x * 0 == 0 這一特性只在值是有限的時(shí)候屉符,
# 而在無限的時(shí)候結(jié)果則是NaN剧浸,即無意義數(shù)
# 所以這里等于用NA代表了兩種結(jié)果
> NA * 0
[1] NA
> Inf * 0
[1] NaN
NaN 和 Null 要區(qū)分, NaN代表無意義矗钟,即 not a number唆香。而 Null 代表的是一種特殊的對象,表示函數(shù)沒有被賦予任何內(nèi)容吨艇。
5.3 Arrange rows with arrange()
-
arrange排序的時(shí)候,缺失值在最后
df <- tibble(x = c(5, 2, NA)) arrange(df, x) #> # A tibble: 3 x 1 #> x #> <dbl> #> 1 2 #> 2 5 #> 3 NA arrange(df, desc(x)) #> # A tibble: 3 x 1 #> x #> <dbl> #> 1 5 #> 2 2 #> 3 NA
-
Exercise 5.3.1
How could you use
arrange()
to sort all missing values to the start? (Hint: useis.na()
).> (df <- tibble(x = c(5, 2, 8, NA, NA, NA), + y = c(4, 3, 1, 4, 3, 8))) # A tibble: 6 x 2 x y <dbl> <dbl> 1 5 4 2 2 3 3 8 1 4 NA 4 5 NA 3 6 NA 8 # 默認(rèn)排序總是會(huì)把NA排在最后 # 然后我們還應(yīng)該注意的一點(diǎn)是NA對應(yīng)的4组贺、3凸舵、8其實(shí)是不會(huì)排序的 #(雖然理論上來說如果x列相同,y列會(huì)再一次排序) # 我們看起來NA和NA是“一樣的”失尖,但前面的知識(shí)告訴我們啊奄,NA == NA的結(jié)果是NA,即不知道雹仿,所以這里并沒有二次排序 > arrange(df, x) # A tibble: 6 x 2 x y <dbl> <dbl> 1 2 3 2 5 4 3 8 1 4 NA 4 5 NA 3 6 NA 8 # 我看了solution之后,明白了 # is.na返回的是TRUE或者FALSE整以,而TRUE在與FALSE比較的時(shí)候胧辽,TRUE > FALSE # 然后再按desc(降序排列,從大到小排序)的話公黑,就是 NA 的在前面了 # 在is.na排完之后邑商,我們就不再排序了,還是按照原來的順序 5 -> 2 -> 8 > arrange(df, desc(is.na(x))) # A tibble: 6 x 2 x y <dbl> <dbl> 1 NA 4 2 NA 3 3 NA 8 4 5 4 5 2 3 6 8 1 # 如果你想在TRUE和FALSE排完之后凡蚜,再做一波排序的話人断,可以在加上x # 那就是先排is.na(x)的結(jié)果,is.na(x)排完朝蜘,再排一次x > arrange(df, desc(is.na(x)),x) # A tibble: 6 x 2 x y <dbl> <dbl> 1 NA 4 2 NA 3 3 NA 8 4 2 3 5 5 4 6 8 1
-
Exercise 5.3.3
Sort flights to find the fastest flights.
# 還可以加上數(shù)學(xué)表達(dá)式的結(jié)果排序 flights %>% arrange(., desc(distance / air_time))
5.4 Select columns with select()
-
select挑選
# 你既可以挑選某幾列 # 也可以去掉某幾列 select(flights, -(year:day)) #> # A tibble: 336,776 x 16 #> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier #> <int> <int> <dbl> <int> <int> <dbl> <chr> #> 1 517 515 2 830 819 11 UA #> 2 533 529 4 850 830 20 UA #> 3 542 540 2 923 850 33 AA #> 4 544 545 -1 1004 1022 -18 B6 #> 5 554 600 -6 812 837 -25 DL #> 6 554 558 -4 740 728 12 UA #> # … with 3.368e+05 more rows, and 9 more variables: flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dttm>
-
There are a number of helper functions you can use within
select()
:-
starts_with("abc")
: matches names that begin with “abc”.> flights %>% + select(., - starts_with("ye")) # A tibble: 336,776 x 18 month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <dbl> <int> 1 1 1 517 515 2 830 2 1 1 533 529 4 850 3 1 1 542 540 2 923 4 1 1 544 545 -1 1004 5 1 1 554 600 -6 812 6 1 1 554 558 -4 740 7 1 1 555 600 -5 913 8 1 1 557 600 -3 709 9 1 1 557 600 -3 838 10 1 1 558 600 -2 753 # ... with 336,766 more rows, and 12 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>
ends_with("xyz")
: matches names that end with “xyz”.contains("ijk")
: matches names that contain “ijk”.matches("(.)\\1")
: selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.num_range("x", 1:3)
: matchesx1
,x2
andx3
.
See
?select
for more details. -
-
rename 和 select的區(qū)別
# 可以看到rename好用多了 > rename(as_tibble(iris), petal_length = Petal.Length) # A tibble: 150 x 5 Sepal.Length Sepal.Width petal_length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa # ... with 140 more rows # select只會(huì)保留你選的 > select(as_tibble(iris), petal_length = Petal.Length) # A tibble: 150 x 1 petal_length <dbl> 1 1.4 2 1.4 3 1.3 4 1.5 5 1.4 6 1.7 7 1.4 8 1.5 9 1.4 10 1.5 # ... with 140 more rows
-
everything也是很好用的
# 可以把某幾列調(diào)到前面 select(flights, time_hour, air_time, everything()) #> # A tibble: 336,776 x 19 #> time_hour air_time year month day dep_time sched_dep_time #> <dttm> <dbl> <int> <int> <int> <int> <int> #> 1 2013-01-01 05:00:00 227 2013 1 1 517 515 #> 2 2013-01-01 05:00:00 227 2013 1 1 533 529 #> 3 2013-01-01 05:00:00 160 2013 1 1 542 540 #> 4 2013-01-01 05:00:00 183 2013 1 1 544 545 #> 5 2013-01-01 06:00:00 116 2013 1 1 554 600 #> 6 2013-01-01 05:00:00 150 2013 1 1 554 558 #> # … with 3.368e+05 more rows, and 12 more variables: dep_delay <dbl>, #> # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, #> # hour <dbl>, minute <dbl>
-
Exercise 5.4.1
Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
from flights.來自solution
# 關(guān)于變量名加不加引號(hào)這點(diǎn)恶迈,我一直沒怎么搞清楚 # 不過后面我搞清楚了一點(diǎn)點(diǎn) select(flights, dep_time, dep_delay, arr_time, arr_delay) select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay") select(flights, 4, 6, 7, 9) variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay") select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay"))) select(flights, one_of(variables)) select(flights, starts_with("dep_"), starts_with("arr_"))
-
Exercise 5.4.2
What happens if you include the name of a variable multiple times in a
select()
call?# 這點(diǎn)對于我們的everything特別有用 # 即select變量名冗余會(huì)只保留一個(gè) select(flights, year, month, day, year, year) #> # A tibble: 336,776 x 3 #> year month day #> <int> <int> <int> #> 1 2013 1 1 #> 2 2013 1 1 #> 3 2013 1 1 #> 4 2013 1 1 #> 5 2013 1 1 #> 6 2013 1 1 #> # … with 3.368e+05 more rows # arr_delay這個(gè)變量是冗余的 select(flights, arr_delay, everything()) #> # A tibble: 336,776 x 19 #> arr_delay year month day dep_time sched_dep_time dep_delay arr_time #> <dbl> <int> <int> <int> <int> <int> <dbl> <int> #> 1 11 2013 1 1 517 515 2 830 #> 2 20 2013 1 1 533 529 4 850 #> 3 33 2013 1 1 542 540 2 923 #> 4 -18 2013 1 1 544 545 -1 1004 #> 5 -25 2013 1 1 554 600 -6 812 #> 6 12 2013 1 1 554 558 -4 740 #> # … with 3.368e+05 more rows, and 11 more variables: sched_arr_time <int>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm>
-
Exercise 5.4.3
What does the
one_of()
function do? Why might it be helpful in conjunction with this vector?這個(gè)solution部分的解答非常值得好好看
vars <- c("year", "month", "day", "dep_delay", "arr_delay") select(flights, one_of(vars)) #> # A tibble: 336,776 x 5 #> year month day dep_delay arr_delay #> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 2 11 #> 2 2013 1 1 4 20 #> 3 2013 1 1 2 33 #> 4 2013 1 1 -1 -18 #> 5 2013 1 1 -6 -25 #> 6 2013 1 1 -4 12 #> # … with 3.368e+05 more rows select(flights, vars) #> # A tibble: 336,776 x 5 #> year month day dep_delay arr_delay #> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 2 11 #> 2 2013 1 1 4 20 #> 3 2013 1 1 2 33 #> 4 2013 1 1 -1 -18 #> 5 2013 1 1 -6 -25 #> 6 2013 1 1 -4 12 #> # … with 3.368e+05 more rows
如果 vars 是flight里面的變量,那么就會(huì)返回名字是vars的變量列谱醇,如果不是flight里面的變量暇仲,就會(huì)查找 vars的值,對應(yīng)的變量列
# 舉個(gè)例子 > year <- "month" > flights %>% + select(., year) %>% + head(., n = 2) # A tibble: 2 x 1 year <int> 1 2013 2 2013 > year_another <- "month" > flights %>% + select(., year_another) %>% + head(., n = 2) # A tibble: 2 x 1 month <int> 1 1 2 1 > year_second <- "month_another" > flights %>% + select(., year_second) %>% + head(., n = 2) Error: Unknown column `month_another` Run `rlang::last_error()` to see where the error occurred.
如果要消除這種“混淆”副渴,利用 !!!
> flights %>% + select(., !!!year) %>% + head(., n = 2) # A tibble: 2 x 1 month <int> 1 1 2 1
This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.
-
Exercise 5.4.4
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME")) #> # A tibble: 336,776 x 6 #> dep_time sched_dep_time arr_time sched_arr_time air_time #> <int> <int> <int> <int> <dbl> #> 1 517 515 830 819 227 #> 2 533 529 850 830 227 #> 3 542 540 923 850 160 #> 4 544 545 1004 1022 183 #> 5 554 600 812 837 116 #> 6 554 558 740 728 150 #> # … with 3.368e+05 more rows, and 1 more variable: time_hour <dttm>
這是因?yàn)榇笮懖幻舾械膯栴}
select(flights, contains("TIME", ignore.case = FALSE)) #> # A tibble: 336,776 x 0
5.5 Add new variables with mutate()
-
我喜歡Mutate這個(gè)特性
Note that you can refer to columns that you’ve just created:
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 ) #> # A tibble: 336,776 x 9 #> year month day dep_delay arr_delay distance air_time gain speed #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 2 11 1400 227 -9 370. #> 2 2013 1 1 4 20 1416 227 -16 374. #> 3 2013 1 1 2 33 1089 160 -31 408. #> 4 2013 1 1 -1 -18 1576 183 17 517. #> 5 2013 1 1 -6 -25 762 116 19 394. #> 6 2013 1 1 -4 12 719 150 -16 288. #> # … with 3.368e+05 more rows mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) #> # A tibble: 336,776 x 10 #> year month day dep_delay arr_delay distance air_time gain hours #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 2 11 1400 227 -9 3.78 #> 2 2013 1 1 4 20 1416 227 -16 3.78 #> 3 2013 1 1 2 33 1089 160 -31 2.67 #> 4 2013 1 1 -1 -18 1576 183 17 3.05 #> 5 2013 1 1 -6 -25 762 116 19 1.93 #> 6 2013 1 1 -4 12 719 150 -16 2.5 #> # … with 3.368e+05 more rows, and 1 more variable: gain_per_hour <dbl>
-
如果你只想保留新產(chǎn)生的變量
transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) #> # A tibble: 336,776 x 3 #> gain hours gain_per_hour #> <dbl> <dbl> <dbl> #> 1 -9 3.78 -2.38 #> 2 -16 3.78 -4.23 #> 3 -31 2.67 -11.6 #> 4 17 3.05 5.57 #> 5 19 1.93 9.83 #> 6 -16 2.5 -6.4 #> # … with 3.368e+05 more rows
-
有許多函數(shù)你都可以和mutate聯(lián)用奈附,從而創(chuàng)造新的變量。這些函數(shù)的關(guān)鍵特性就是向量化煮剧,即你輸入一個(gè)向量斥滤,輸出也是一個(gè)向量,輸入輸出向量里面包含的值是對應(yīng)的勉盅。
-
Arithmetic operators:
+
,-
,*
,/
,^
. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number:air_time / 60
,hours * 60 + minute
, etc.Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example,
x / sum(x)
calculates the proportion of a total, andy - mean(y)
computes the difference from the mean. -
Modular arithmetic(所謂的模運(yùn)算):
%/%
(integer division) and%%
(remainder), wherex == y * (x %/% y) + (x %% y)
. Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can computehour
andminute
fromdep_time
with:transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100 ) #> # A tibble: 336,776 x 3 #> dep_time hour minute #> <int> <dbl> <dbl> #> 1 517 5 17 #> 2 533 5 33 #> 3 542 5 42 #> 4 544 5 44 #> 5 554 5 54 #> 6 554 5 54 #> # … with 3.368e+05 more rows
-
Logs:
log()
,log2()
,log10()
. Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive(累乘關(guān)系變成累加關(guān)系佑颇,最大似然估計(jì)那邊會(huì)用到吧), a feature we’ll come back to in modelling.All else being equal, I recommend using
log2()
because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.(對數(shù)標(biāo)度下的數(shù)值增加 1 個(gè)單位,意味著初始數(shù)值加倍草娜;減少 1 個(gè)單位漩符,則意味著初始數(shù)值減半。常見的就是log2FoldChange了) -
Offsets:
lead()
andlag()
allow you to refer to leading or lagging values.(感覺是整體數(shù)據(jù)往前或者往后挪一格) This allows you to compute running differences (e.g.x - lag(x)
) or find when values change (x != lag(x)
). They are most useful in conjunction withgroup_by()
, which you’ll learn about shortly.(x <- 1:10) #> [1] 1 2 3 4 5 6 7 8 9 10 lag(x) #> [1] NA 1 2 3 4 5 6 7 8 9 lead(x) #> [1] 2 3 4 5 6 7 8 9 10 NA
-
Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
; and dplyr providescummean()
for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.x #> [1] 1 2 3 4 5 6 7 8 9 10 # 累積求和 cumsum(x) #> [1] 1 3 6 10 15 21 28 36 45 55 # 累積求平均 cummean(x) #> [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
RcppRoll包的使用(我覺得RcppRoll包對于一些基因組的數(shù)據(jù)應(yīng)該很有用)
install.packages("RcppRoll") library(RcppRoll) # 看參數(shù)說明 # n 滾動(dòng)求和窗口的大小 # by 表示每次窗口移動(dòng)的距離 # 也就是說如果 n = by的話驱还,就是我們常見的bw壓縮或者分bin求值的基因組了吧 (x <- 1:10) [1] 1 2 3 4 5 6 7 8 9 10 > roll_sum(x, n = 3, by = 3) [1] 6 15 24 > roll_sum(x, n = 3, by = 2) [1] 6 12 18 24 > roll_sum(x, n = 3, by = 1) [1] 6 9 12 15 18 21 24 27
又讓我想起了Sliding window 和 fixed windows嗜暴。
我估計(jì)是sliding windows會(huì)有bin(這里的bin經(jīng)常和window是混用的)(bin就是RcppRoll的n)和step(Step就是RcppRoll的by)兩個(gè)參數(shù)凸克。但fixed windows只有一個(gè)參數(shù),即bin闷沥,應(yīng)該就沒有step了萎战,或者說step等于window。或者說 sliding windows 又會(huì)分overlap or be disjoint舆逃。overlap就是我們step < bin蚂维,而 disjoint 就是 step >= bin 了。
參考
-
Logical comparisons,
<
,<=
,>
,>=
,!=
, and==
, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.-
Ranking: there are a number of ranking functions, but you should start with
min_rank()
. It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; usedesc(x)
to give the largest values the smallest ranks.y <- c(1, 2, 2, NA, 3, 4) min_rank(y) #> [1] 1 2 2 NA 4 5 min_rank(desc(y)) #> [1] 5 3 3 NA 2 1
If
min_rank()
doesn’t do what you need, look at the variantsrow_number()
,dense_rank()
,percent_rank()
,cume_dist()
,ntile()
. See their help pages for more details.row_number(y) # row_number(): equivalent to rank(ties.method = "first") #> [1] 1 2 3 NA 4 5 dense_rank(y) # dense_rank(): like min_rank(), but with no gaps between ranks #> [1] 1 2 2 NA 3 4 percent_rank(y) # 這是分位數(shù)(向量化版本) #> [1] 0.00 0.25 0.25 NA 0.75 1.00 cume_dist(y) # 這是累積分布函數(shù)(向量化版本) #> [1] 0.2 0.6 0.6 NA 0.8 1.0
Exercise 5.5.4的solution有一個(gè)解釋我覺得蠻好的
rankme <- tibble( x = c(10, 5, 1, 5, 5) ) rankme <- mutate(rankme, x_row_number = row_number(x), x_min_rank = min_rank(x), x_dense_rank = dense_rank(x) ) arrange(rankme, x) #> # A tibble: 5 x 4 #> x x_row_number x_min_rank x_dense_rank #> <dbl> <int> <int> <int> #> 1 1 1 1 1 #> 2 5 2 2 2 #> 3 5 3 2 2 #> 4 5 4 2 2 #> 5 10 5 5 3
-
Exercise 5.5.1
Currently
dep_time
andsched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.看了solution才理解路狮,這題意思應(yīng)該是虫啥,現(xiàn)在 dep_time 里面的時(shí)間,表示很方便(517代表了5:17)奄妨, 但計(jì)算很麻煩(517-417是100涂籽,但代表的不是100min,而是60分鐘)砸抛。所以我們應(yīng)該進(jìn)行轉(zhuǎn)換评雌。轉(zhuǎn)換成距離午夜12點(diǎn)的時(shí)間
# 以1504(15:04)舉例 # 這里產(chǎn)生的904,就是距離午夜12點(diǎn)(24:00)904min 1504 %/% 100 * 60 + 1504 %% 100 #> [1] 904 # 然后有一個(gè)小問題就是直焙,距離午夜1440分鐘景东,恰好是午夜12點(diǎn),剛好一個(gè)輪回 # 那么應(yīng)該是距離午夜12點(diǎn) 0分鐘奔誓,而不是1440 # 所以最后是變成了 (1504 %/% 100 * 60 + 1504 %% 100) %% 1440 # 轉(zhuǎn)換的話 flights %>% select(., dep_time, sched_dep_time) %>% mutate(., dep_time_minus = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440, sched_dep_time_minus = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440) # A tibble: 336,776 x 4 dep_time sched_dep_time dep_time_minus sched_dep_time_minus <int> <int> <dbl> <dbl> 1 517 515 317 315 2 533 529 333 329 3 542 540 342 340 4 544 545 344 345 5 554 600 354 360 6 554 558 354 358 7 555 600 355 360 8 557 600 357 360 9 557 600 357 360 10 558 600 358 360 # ... with 336,766 more rows # 還可以寫成函數(shù) time2mins <- function(x) { (x %/% 100 * 60 + x %% 100) %% 1440 }
-
Exercise 5.5.6
What trigonometric functions does R provide?
這部分solution寫的特別詳細(xì)斤吐,大家有需要的自己去看就行
5.6 Grouped summaries with summarise()
-
It collapses a data frame to a single row
summarise()
is not terribly useful unless we pair it withgroup_by()
. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) #> # A tibble: 1 x 1 #> delay #> <dbl> #> 1 12.6 # 記得NA.rm,有傳染性 flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay)) #> # A tibble: 365 x 4 #> # Groups: year, month [12] #> year month day mean #> <int> <int> <int> <dbl> #> 1 2013 1 1 NA #> 2 2013 1 2 NA #> 3 2013 1 3 NA #> 4 2013 1 4 NA #> 5 2013 1 5 NA #> 6 2013 1 6 NA #> # … with 359 more rows
我有時(shí)候感覺group就相當(dāng)于把觀測變得有層次了厨喂,比如上面就是先根據(jù)year劃一層曲初,然后根據(jù)month劃一層,最后根據(jù)day劃一層杯聚。
Whenever you do any aggregation, it’s always a good idea to include either a count (
n()
), or a count of non-missing values (sum(!is.na(x))
). That way you can check that you’re not drawing conclusions based on very small amounts of data.
-
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
-
Measures of location: we’ve used
mean(x)
, butmedian(x)
is also useful. The mean is the sum divided by the length; the median is a value where 50% ofx
is above it, and 50% is below it.It’s sometimes useful to combine aggregation with logical subsetting. We haven’t talked about this sort of subsetting yet, but you’ll learn more about it in subsetting.
not_cancelled %>% group_by(year, month, day) %>% summarise( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay ) #> # A tibble: 365 x 5 #> # Groups: year, month [12] #> year month day avg_delay1 avg_delay2 #> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 12.7 32.5 #> 2 2013 1 2 12.7 32.0 #> 3 2013 1 3 5.73 27.7 #> 4 2013 1 4 -1.93 28.3 #> 5 2013 1 5 -1.53 22.6 #> 6 2013 1 6 4.24 24.4 #> # … with 359 more rows
-
Measures of spread:
sd(x)
,IQR(x)
(IQR就是boxplot里面的那個(gè)上四分位點(diǎn) - 下四分位點(diǎn)),mad(x)
. The root mean squared deviation, or standard deviationsd(x)
, is the standard measure of spread. The interquartile rangeIQR(x)
and median absolute deviationmad(x)
are robust equivalents that may be more useful if you have outliers.# Why is distance to some destinations more variable than to others? not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) #> # A tibble: 104 x 2 #> dest distance_sd #> <chr> <dbl> #> 1 EGE 10.5 #> 2 SAN 10.4 #> 3 SFO 10.2 #> 4 HNL 10.0 #> 5 SEA 9.98 #> 6 LAS 9.91 #> # … with 98 more rows
-
Measures of rank:
min(x)
,quantile(x, 0.25)
,max(x)
. Quantiles are a generalisation of the median. For example,quantile(x, 0.25)
will find a value ofx
that is greater than 25% of the values, and less than the remaining 75%.# When do the first and last flights leave each day? not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) ) #> # A tibble: 365 x 5 #> # Groups: year, month [12] #> year month day first last #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 2356 #> 2 2013 1 2 42 2354 #> 3 2013 1 3 32 2349 #> 4 2013 1 4 25 2358 #> 5 2013 1 5 14 2357 #> 6 2013 1 6 16 2355 #> # … with 359 more rows
-
Measures of position:
first(x)
,nth(x, 2)
,last(x)
. These work similarly tox[1]
,x[2]
, andx[length(x)]
but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) #> # A tibble: 365 x 5 #> # Groups: year, month [12] #> year month day first_dep last_dep #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 2356 #> 2 2013 1 2 42 2354 #> 3 2013 1 3 32 2349 #> 4 2013 1 4 25 2358 #> 5 2013 1 5 14 2357 #> 6 2013 1 6 16 2355 #> # … with 359 more rows
These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:
not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r)) #> # A tibble: 770 x 20 #> # Groups: year, month, day [365] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 2356 2359 -3 425 437 #> 3 2013 1 2 42 2359 43 518 442 #> 4 2013 1 2 2354 2359 -5 413 437 #> 5 2013 1 3 32 2359 33 504 442 #> 6 2013 1 3 2349 2359 -10 434 445 #> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int> # 上面的不太清楚 # 其實(shí)這就等于是把最大的dep_time和最小的dep_time同時(shí)輸出了 > not_cancelled %>% + group_by(year, month, day) %>% + select(., year:day, dep_time) %>% + mutate(r = min_rank(desc(dep_time))) %>% + filter(r %in% range(r)) # A tibble: 770 x 5 # Groups: year, month, day [365] year month day dep_time r <int> <int> <int> <int> <int> 1 2013 1 1 517 831 2 2013 1 1 2356 1 3 2013 1 2 42 928 4 2013 1 2 2354 1 5 2013 1 3 32 900 6 2013 1 3 2349 1 7 2013 1 4 25 908 8 2013 1 4 2358 1 9 2013 1 4 2358 1 10 2013 1 5 14 717 # ... with 760 more rows
-
Counts: You’ve seen
n()
, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, usesum(!is.na(x))
. To count the number of distinct (unique) values, usen_distinct(x)
.(這函數(shù)可以唉,類似于Linux里面的uniq -c幌绍?)# Which destinations have the most carriers? not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) #> # A tibble: 104 x 2 #> dest carriers #> <chr> <int> #> 1 ATL 7 #> 2 BOS 7 #> 3 CLT 7 #> 4 ORD 7 #> 5 TPA 7 #> 6 AUS 6 #> # … with 98 more rows
Counts are so useful that dplyr provides a simple helper if all you want is a count:
# 你可以看到not_cancelled是沒有g(shù)roup的 > not_cancelled # A tibble: 327,346 x 19 # count自動(dòng)完成了group_by+n() not_cancelled %>% count(dest) #> # A tibble: 104 x 2 #> dest n #> <chr> <int> #> 1 ABQ 254 #> 2 ACK 264 #> 3 ALB 418 #> 4 ANC 8 #> 5 ATL 16837 #> 6 AUS 2411 #> # … with 98 more rows # 我其實(shí)感覺 n() 就類似于 length() 吧 # 來自solution的exercise 5.6.2 > not_cancelled %>% + group_by(dest) %>% + summarise(n = length(dest)) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ABQ 254 2 ACK 264 3 ALB 418 > not_cancelled %>% + group_by(dest) %>% + summarise(n = n()) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ABQ 254 2 ACK 264 3 ALB 418
You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:
之前是數(shù)每個(gè)group下的個(gè)數(shù)颁褂,這里是對每個(gè)group下,你指定的varible進(jìn)行求和
# 數(shù)個(gè)數(shù) > not_cancelled %>% + count(tailnum) %>% + head(., n = 3) # A tibble: 3 x 2 tailnum n <chr> <int> 1 D942DN 4 2 N0EGMQ 352 3 N10156 145 > not_cancelled %>% + group_by(tailnum) %>% + summarise(n = n()) %>% + head(n = 3) # A tibble: 3 x 2 tailnum n <chr> <int> 1 D942DN 4 2 N0EGMQ 352 3 N10156 145 # 求和 > not_cancelled %>% + count(tailnum, wt = distance) %>% + head(., n = 3) # A tibble: 3 x 2 tailnum n <chr> <dbl> 1 D942DN 3418 2 N0EGMQ 239143 3 N10156 109664 > not_cancelled %>% + select(tailnum, distance) %>% + group_by(tailnum) %>% + summarise(n = sum(distance)) %>% + head(n = 3) # A tibble: 3 x 2 tailnum n <chr> <dbl> 1 D942 3418 2 N0EGMQ 239143 3 N10156 109664
Counts and proportions of logical values:
sum(x > 10)
,mean(y == 0)
. When used with numeric functions,TRUE
is converted to 1 andFALSE
to 0. This makessum()
andmean()
very useful:sum(x)
gives the number ofTRUE
s inx
, andmean(x)
gives the proportion.
這個(gè)操作是很好的傀广,舉個(gè)最簡單的例子
# 下面這個(gè)例子可以用來說明 # 正態(tài)分布的中位數(shù)是0 > set.seed(19960203) > mean(rnorm(1E6) > 0) [1] 0.500266 > set.seed(19960203) > sum(rnorm(1E6) > 0) / 1E6 [1] 0.500266
# How many flights left before 5am? (these usually indicate delayed # flights from the previous day) not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500)) #> # A tibble: 365 x 4 #> # Groups: year, month [12] #> year month day n_early #> <int> <int> <int> <int> #> 1 2013 1 1 0 #> 2 2013 1 2 3 #> 3 2013 1 3 4 #> 4 2013 1 4 3 #> 5 2013 1 5 3 #> 6 2013 1 6 2 #> # … with 359 more rows # What proportion of flights are delayed by more than an hour? not_cancelled %>% group_by(year, month, day) %>% summarise(hour_prop = mean(arr_delay > 60)) #> # A tibble: 365 x 4 #> # Groups: year, month [12] #> year month day hour_prop #> <int> <int> <int> <dbl> #> 1 2013 1 1 0.0722 #> 2 2013 1 2 0.0851 #> 3 2013 1 3 0.0567 #> 4 2013 1 4 0.0396 #> 5 2013 1 5 0.0349 #> 6 2013 1 6 0.0470 #> # … with 359 more rows
-
-
有意思的功能:當(dāng)使用多個(gè)變量進(jìn)行分組時(shí)颁独,每次的摘要統(tǒng)計(jì)會(huì)用掉一個(gè)分組變量。這樣就可以輕松地對數(shù)據(jù)集進(jìn)行循序漸進(jìn)的分析:
daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n())) #> # A tibble: 365 x 4 #> # Groups: year, month [12] #> year month day flights #> <int> <int> <int> <int> #> 1 2013 1 1 842 #> 2 2013 1 2 943 #> 3 2013 1 3 914 #> 4 2013 1 4 915 #> 5 2013 1 5 720 #> 6 2013 1 6 832 #> # … with 359 more rows (per_month <- summarise(per_day, flights = sum(flights))) #> # A tibble: 12 x 3 #> # Groups: year [1] #> year month flights #> <int> <int> <int> #> 1 2013 1 27004 #> 2 2013 2 24951 #> 3 2013 3 28834 #> 4 2013 4 28330 #> 5 2013 5 28796 #> 6 2013 6 28243 #> # … with 6 more rows (per_year <- summarise(per_month, flights = sum(flights))) #> # A tibble: 1 x 2 #> year flights #> <int> <int> #> 1 2013 336776
在循序漸進(jìn)地進(jìn)行摘要分析時(shí)伪冰,需要小心:使用求和與計(jì)數(shù)操作是沒問題的誓酒,但如果想要使用加權(quán)平均和方差的話,就要仔細(xì)考慮一下,在基于秩的統(tǒng)計(jì)數(shù)據(jù)(如中位數(shù))上是無法進(jìn)行這些操作的靠柑。換句話說寨辩,對分組求和的結(jié)果再求和就是對整體求和,但分組中位數(shù)的中位數(shù)可不是整體的中位數(shù)歼冰。(這話來自中文版)
我覺得意思就是
# 整體求和 和 把整體分割成組靡狞,然后分組求和的結(jié)果是一樣的
# 這個(gè)你可以通過自己列公式看出來
> sum(1:8)
[1] 36
> sum(sum(1:4),sum(5:6),sum(7:8))
[1] 36
# 但整體求中位數(shù) 和 把整體分割成組,然后分組求中位數(shù)隔嫡,再求中位數(shù)的結(jié)果是不一樣的
> median(1:8)
[1] 4.5
> median(median(1:4),median(5:6),median(7:8))
[1] 2.5
# 均值同理
> mean(1:8)
[1] 4.5
> mean(mean(1:4),mean(5:6),mean(7:8))
[1] 2.5
-
Exercise 5.6.2
Come up with another approach that will give you the same output as
not_cancelled %>% count(dest)
andnot_cancelled %>% count(tailnum, wt = distance)
(without usingcount()
).not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) > not_cancelled %>% + count(dest) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ABQ 254 2 ACK 264 3 ALB 418 > not_cancelled %>% + group_by(dest) %>% + summarise(n = length(dest)) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ABQ 254 2 ACK 264 3 ALB 418 -------------------------------------------------------------------------------------------------- # 其實(shí)group_by加summarise我覺得可以認(rèn)為是下面這個(gè)操作的多次循環(huán) # 以ABQ為例 > not_cancelled %>% + filter(., dest == "ABQ") %>% + nrow() [1] 254 ------------------------------------------------------------------------------------------------------ > not_cancelled %>% + group_by(dest) %>% + summarise(n = n()) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ABQ 254 2 ACK 264 3 ALB 418 # Another alternative to count() is to use the combination of the group_by() and tally() verbs. # In fact, count() is effectively a short-cut for group_by() followed by tally(). > not_cancelled %>% + group_by(dest) %>% + tally() %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ABQ 254 2 ACK 264 3 ALB 418
> not_cancelled %>%
+ count(tailnum, wt = distance) %>%
+ head(n = 3)
# A tibble: 3 x 2
tailnum n
<chr> <dbl>
1 D942DN 3418
2 N0EGMQ 239143
3 N10156 109664
> not_cancelled %>%
+ group_by(tailnum) %>%
+ summarise(n = sum(distance)) %>%
+ head(n = 3)
# A tibble: 3 x 2
tailnum n
<chr> <dbl>
1 D942DN 3418
2 N0EGMQ 239143
3 N10156 109664
# 同樣舉個(gè)例子
> not_cancelled %>%
+ filter(., tailnum == "D942DN") %>%
+ pull(distance) %>%
+ sum()
[1] 3418
# Like the previous example, we can also use the combination group_by() and tally().
# Any arguments to tally() are summed.
> not_cancelled %>%
+ group_by(tailnum) %>%
+ tally(distance) %>%
+ head(n = 3)
# A tibble: 3 x 2
tailnum n
<chr> <dbl>
1 D942DN 3418
2 N0EGMQ 239143
3 N10156 109664
-
Exercise 5.6.6
What does the sort argument to
count()
do? When might you use it?> not_cancelled %>% + count(dest, sort = T) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ATL 16837 2 ORD 16566 3 LAX 16026 # 等價(jià)于 > not_cancelled %>% + count(dest) %>% + arrange(., desc(n)) %>% + head(n = 3) # A tibble: 3 x 2 dest n <chr> <int> 1 ATL 16837 2 ORD 16566 3 LAX 16026
5.7 Grouped mutates (and filters)
-
雖然與 summarize() 函數(shù)結(jié)合起來使用是最有效的甸怕,但分組也可以與 mutate() 和 filter()函數(shù)結(jié)合,以完成非常便捷的操作腮恩。(中文版這段話)
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) # Find the worst members of each group: # 這里對arr_delay進(jìn)行排序梢杭,按從大到小排。然后再得到rank秸滴,rank越靠前就是arr_delay越大的 # 這里就是得到每組前10多延遲的 flights_sml %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) %>% head(n = 3) # A tibble: 3 x 7 # Groups: year, month, day [1] year month day dep_delay arr_delay distance air_time <int> <int> <int> <dbl> <dbl> <dbl> <dbl> 1 2013 1 1 853 851 184 41 2 2013 1 1 290 338 1134 213 3 2013 1 1 260 263 266 46 # Find all groups bigger than a threshold: popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) # Standardise to compute per group metrics: # 這里等于是計(jì)算了百分比 # 始終記孜淦酢:向量化,向量化 popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay) #> # A tibble: 131,106 x 6 #> # Groups: dest [77] #> year month day dest arr_delay prop_delay #> <int> <int> <int> <chr> <dbl> <dbl> #> 1 2013 1 1 IAH 11 0.000111 #> 2 2013 1 1 IAH 20 0.000201 #> 3 2013 1 1 MIA 33 0.000235 #> 4 2013 1 1 ORD 12 0.0000424 #> 5 2013 1 1 FLL 19 0.0000938 #> 6 2013 1 1 ORD 8 0.0000283 #> # … with 1.311e+05 more rows # 這樣可能直觀一點(diǎn) > popular_dests %>% + filter(arr_delay > 0) %>% + mutate(prop_delay = arr_delay / sum(arr_delay)) %>% + select(year:day, dest, arr_delay, prop_delay) %>% + arrange(dest) # A tibble: 131,106 x 6 # Groups: dest [77] year month day dest arr_delay prop_delay <int> <int> <int> <chr> <dbl> <dbl> 1 2013 1 1 ALB 40 0.00418 2 2013 1 1 ALB 44 0.00459 3 2013 1 2 ALB 71 0.00741 4 2013 1 2 ALB 82 0.00856 5 2013 1 3 ALB 40 0.00418 6 2013 1 4 ALB 30 0.00313 7 2013 1 6 ALB 95 0.00992 8 2013 1 6 ALB 4 0.000418 9 2013 1 7 ALB 41 0.00428 10 2013 1 10 ALB 120 0.0125 # ... with 131,096 more rows