5.6 信息匯總summarise()
最后一個是summarise()
。它將數(shù)據(jù)框折疊為一行:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 x 1
#> delay
#> <dbl>
#> 1 12.6
(na.rm = TRUE
意味著什么?)
一般是將Summarise()
與group_by()
一起使用,否則它并不是特別有用。這將分析范圍從完整數(shù)據(jù)集更改為單個組。然后,當(dāng)您在分組數(shù)據(jù)幀上使用dplyr
時筹燕,它們將自動“按組”分配。例如衅鹿,如果我們對日期分組撒踪,將得到每個日期的平均延遲:
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
group_by()和summarise()一起使用提供最常用的工具之一:分組摘要。但在我們進(jìn)一步討論這個問題之前大渤,我們需要了解管道制妄。
5.6.1 用管道連接多個操作
想象一下,我們要探索每個位置的距離和平均延遲之間的關(guān)系泵三。通過對 dplyr 的了解耕捞,可以編寫如下代碼:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
delay <- filter(delay, count > 20, dest != "HNL")
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
準(zhǔn)備這些數(shù)據(jù)需要三個步驟:
按目的地對航班分組。
統(tǒng)計并計算距離烫幕、平均延誤和航班數(shù)量俺抽。
過濾以去除噪聲點和檀香山機(jī)場,該機(jī)場的距離幾乎是下一個最近機(jī)場的兩倍较曼。
這段代碼寫起來有點繁瑣磷斧,因為我們必須為每個中間數(shù)據(jù)幀命名。對每個變量都要命名捷犹,因此這會減慢我們的分析速度弛饭。
以下方法通過管道可以解決相同的問題 %>%
:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
#> `summarise()` ungrouping output (override with `.groups` argument)
實際上,x %>% f(y)
變成f(x, y)
萍歉,x %>% f(y) %>% g(z)
變成g(f(x, y), z)
等等侣颂。您可以使用管道以從左到右、從上到下閱讀的方式重寫多個操作枪孩。從現(xiàn)在開始我們將經(jīng)常使用管道憔晒,因為它大大提高了代碼的可讀性,我們將在管道中更詳細(xì)地回到它蔑舞。
5.6.2 缺失值
我們上面使用的參數(shù)na.rm
拒担。如果我們不設(shè)置它會發(fā)生什么?
flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day mean
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 NA
#> 2 2013 1 2 NA
#> 3 2013 1 3 NA
#> 4 2013 1 4 NA
#> 5 2013 1 5 NA
#> 6 2013 1 6 NA
#> # … with 359 more rows
我們將會得到了很多缺失值斗幼!這是因為聚合函數(shù)遵循缺失值的通用規(guī)則:如果輸入中有任何缺失值,則輸出將是缺失值抚垄。然而所有聚合函數(shù)都有一個na.rm
參數(shù)蜕窿,我們可以在計算之前刪除缺失值:
flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day mean
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
在此處谋逻,缺失值代表取消的航班,我們還可以先刪除取消的航班來解決該問題桐经。我們將保存此數(shù)據(jù)集毁兆,以便在接下來的幾個示例中重復(fù)使用它。
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day mean
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.4
#> 2 2013 1 2 13.7
#> 3 2013 1 3 10.9
#> 4 2013 1 4 8.97
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
5.6.3 計數(shù)
無論何時進(jìn)行任何聚合阴挣,包含一個count (n()
)或一個非缺失值的計數(shù)(sum(!is.na(x)
)都是很好選擇气堕。這樣你就可以確定你不是基于非常少量的數(shù)據(jù)得出結(jié)論。例如畔咧,讓我們看看平均延誤時間最高的飛機(jī)(通過機(jī)尾號來確定):
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
可以看到茎芭,有些飛機(jī)平均延誤了 5 小時(300 分鐘)!
如果我們繪制航班數(shù)量與平均延誤的散點圖誓沸,我們可以獲得更多信息:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
#> `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
毫不奇怪梅桩,當(dāng)航班很少時,平均延誤的變化要大得多拜隧。該圖的形狀非常有特點:每當(dāng)您繪制均值(或其他匯總)與組大小的關(guān)系圖時宿百,您會看到變異隨著樣本大小的增加而減小。
在查看此類圖時洪添,過濾掉具有最少觀測值的組通常很有用垦页,這樣您就可以看到更多的模式,并減少最小組中的極端變化干奢。這就是以下代碼的作用痊焊,并向您展示了將 ggplot2 集成到 dplyr 流中的便捷模式。必須從%>%
切換到有+的過程律胀。
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
當(dāng)我將擊球手的技巧(以擊球平均值ba
衡量)與擊球機(jī)會數(shù)(以擊球次數(shù)ab
衡量)作圖時宋光,您會看到兩種模式:
如上所述,隨著我們獲得更多數(shù)據(jù)點炭菌,我們的聚合變化會減少罪佳。
擊球技巧 (
ba
) 和擊球次數(shù)(ab
)之間存在正相關(guān)關(guān)系。這是因為球隊控制誰可以上場黑低,而且顯然他們會挑選最好的球員赘艳。
# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
這對排名也有重要影響。如果你直接排序desc(ba)
克握,具有最佳擊球率的人顯然是幸運(yùn)的蕾管,而不是技術(shù)嫻熟的:
batters %>%
arrange(desc(ba))
#> # A tibble: 19,689 x 3
#> playerID ba ab
#> <chr> <dbl> <int>
#> 1 abramge01 1 1
#> 2 alanirj01 1 1
#> 3 alberan01 1 1
#> 4 banisje01 1 1
#> 5 bartocl01 1 1
#> 6 bassdo01 1 1
#> # … with 19,683 more rows
5.6.4 其它統(tǒng)計函數(shù)
只使用means, counts和 sum 可以解決很多問題,但R提供了許多其他有用的統(tǒng)計函數(shù):
-
位置測量:我們使用過mean(x)菩暗,但median(x)也很有用掰曾。平均值是總和除以長度;中位數(shù)是一個值停团,
x
中 50%高于中位數(shù)旷坦,50% 低于中位數(shù)掏熬。有時將匯總與邏輯子集相結(jié)合很有用。
not_cancelled %>% group_by(year, month, day) %>% summarise( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay ) #> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument) #> # A tibble: 365 x 5 #> # Groups: year, month [12] #> year month day avg_delay1 avg_delay2 #> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 12.7 32.5 #> 2 2013 1 2 12.7 32.0 #> 3 2013 1 3 5.73 27.7 #> 4 2013 1 4 -1.93 28.3 #> 5 2013 1 5 -1.53 22.6 #> 6 2013 1 6 4.24 24.4 #> # … with 359 more rows
-
數(shù)據(jù)分布:sd(x), IQR(x), mad(x)秒梅。均方根偏差或標(biāo)準(zhǔn)偏差sd(x)是數(shù)據(jù)分布的標(biāo)準(zhǔn)度量旗芬。四分位數(shù)IQR(x)和中值絕對偏差mad(x)`是有用的選項,如果您有異常值捆蜀,它們可能更有用疮丛。
# Why is distance to some destinations more variable than to others? not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) #> `summarise()` ungrouping output (override with `.groups` argument) #> # A tibble: 104 x 2 #> dest distance_sd #> <chr> <dbl> #> 1 EGE 10.5 #> 2 SAN 10.4 #> 3 SFO 10.2 #> 4 HNL 10.0 #> 5 SEA 9.98 #> 6 LAS 9.91 #> # … with 98 more rows
-
等級度量:min(x), quantile(x, 0.25), max(x)。分位數(shù)和中位數(shù)定義相似辆它。例如誊薄,quantile(x, 0.25) 會發(fā)現(xiàn)一個值
x
大于 25% 的值,而小于其余 75% 的值娩井。# When do the first and last flights leave each day? not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) ) #> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument) #> # A tibble: 365 x 5 #> # Groups: year, month [12] #> year month day first last #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 2356 #> 2 2013 1 2 42 2354 #> 3 2013 1 3 32 2349 #> 4 2013 1 4 25 2358 #> 5 2013 1 5 14 2357 #> 6 2013 1 6 16 2355 #> # … with 359 more rows
-
位置測量:
first(x)
,nth(x, 2)
,last(x)
暇屋。這些類似于x[1]
,x[2]
,x[length(x)]
但如果該位置不存在,則讓您設(shè)置默認(rèn)值(即您試圖從只有兩個元素的組中獲取第三個元素)洞辣。例如咐刨,我們可以找到每天的第一次和最后一次出發(fā):not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) #> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument) #> # A tibble: 365 x 5 #> # Groups: year, month [12] #> year month day first_dep last_dep #> <int> <int> <int> <int> <int> #> 1 2013 1 1 517 2356 #> 2 2013 1 2 42 2354 #> 3 2013 1 3 32 2349 #> 4 2013 1 4 25 2358 #> 5 2013 1 5 14 2357 #> 6 2013 1 6 16 2355 #> # … with 359 more rows
這些函數(shù)是對排序過濾后的補(bǔ)充。過濾可以為您提供所有變量扬霜,每個觀察值都在單獨的行中:
not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r)) #> # A tibble: 770 x 20 #> # Groups: year, month, day [365] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 2356 2359 -3 425 437 #> 3 2013 1 2 42 2359 43 518 442 #> 4 2013 1 2 2354 2359 -5 413 437 #> 5 2013 1 3 32 2359 33 504 442 #> 6 2013 1 3 2349 2359 -10 434 445 #> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>, #> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, #> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>
-
Counts:你已經(jīng)使用了
n()
定鸟,它不接受任何參數(shù),并返回當(dāng)前組的大小著瓶。要計算非缺失值的數(shù)量联予,請使用sum(!is.na(x))
。要計算不同(唯一)值的數(shù)量材原,請使用n_distinct(x)
.# Which destinations have the most carriers? not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) #> `summarise()` ungrouping output (override with `.groups` argument) #> # A tibble: 104 x 2 #> dest carriers #> <chr> <int> #> 1 ATL 7 #> 2 BOS 7 #> 3 CLT 7 #> 4 ORD 7 #> 5 TPA 7 #> 6 AUS 6 #> # … with 98 more rows
計數(shù)非常有用沸久,如果你只想要計數(shù),dplyr 提供了一個函數(shù)
count()
:not_cancelled %>% count(dest) #> # A tibble: 104 x 2 #> dest n #> <chr> <int> #> 1 ABQ 254 #> 2 ACK 264 #> 3 ALB 418 #> 4 ANC 8 #> 5 ATL 16837 #> 6 AUS 2411 #> # … with 98 more rows
您可以提供權(quán)重變量余蟹。例如卷胯,您可以使用它來“count”(sum)飛機(jī)飛行的總英里數(shù):
not_cancelled %>% count(tailnum, wt = distance) #> # A tibble: 4,037 x 2 #> tailnum n #> <chr> <dbl> #> 1 D942DN 3418 #> 2 N0EGMQ 239143 #> 3 N10156 109664 #> 4 N102UW 25722 #> 5 N103US 24619 #> 6 N104UW 24616 #> # … with 4,031 more rows
-
邏輯值的計數(shù)和比例:sum(x > 10), mean(y == 0)。 當(dāng)與數(shù)字函數(shù)使用時威酒,
TRUE
被轉(zhuǎn)換成1和FALSE
轉(zhuǎn)換成0窑睁。這使得sum()和mean()非常有用的:sum(x)給出在x
中TRUE
的數(shù)量,而mean(x)給出的比例葵孤。# How many flights left before 5am? (these usually indicate delayed # flights from the previous day) not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500)) #> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument) #> # A tibble: 365 x 4 #> # Groups: year, month [12] #> year month day n_early #> <int> <int> <int> <int> #> 1 2013 1 1 0 #> 2 2013 1 2 3 #> 3 2013 1 3 4 #> 4 2013 1 4 3 #> 5 2013 1 5 3 #> 6 2013 1 6 2 #> # … with 359 more rows # What proportion of flights are delayed by more than an hour? not_cancelled %>% group_by(year, month, day) %>% summarise(hour_prop = mean(arr_delay > 60)) #> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument) #> # A tibble: 365 x 4 #> # Groups: year, month [12] #> year month day hour_prop #> <int> <int> <int> <dbl> #> 1 2013 1 1 0.0722 #> 2 2013 1 2 0.0851 #> 3 2013 1 3 0.0567 #> 4 2013 1 4 0.0396 #> 5 2013 1 5 0.0349 #> 6 2013 1 6 0.0470 #> # … with 359 more rows
5.6.5 多變量分組
當(dāng)您按多個變量分組時担钮,每個摘要都會剝離分組的一個級別。這使得逐步匯總數(shù)據(jù)集變得容易:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day flights
#> <int> <int> <int> <int>
#> 1 2013 1 1 842
#> 2 2013 1 2 943
#> 3 2013 1 3 914
#> 4 2013 1 4 915
#> 5 2013 1 5 720
#> 6 2013 1 6 832
#> # … with 359 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
#> `summarise()` regrouping output by 'year' (override with `.groups` argument)
#> # A tibble: 12 x 3
#> # Groups: year [1]
#> year month flights
#> <int> <int> <int>
#> 1 2013 1 27004
#> 2 2013 2 24951
#> 3 2013 3 28834
#> 4 2013 4 28330
#> 5 2013 5 28796
#> 6 2013 6 28243
#> # … with 6 more rows
(per_year <- summarise(per_month, flights = sum(flights)))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 1 x 2
#> year flights
#> <int> <int>
#> 1 2013 336776
逐步匯總匯總時要小心:sum和count是可以的尤仍,但需要考慮加權(quán)均值和方差箫津,而對于基于等級的統(tǒng)計數(shù)據(jù)(如中位數(shù)),不可能完全做到這一點。換句話說苏遥,分組總和的總和是總和送挑,但分組中位數(shù)的中位數(shù)不是總中位數(shù)。
5.6.6 取消分組
如果需要刪除分組暖眼,并返回對未分組的數(shù)據(jù),請使用ungroup()
.
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
#> # A tibble: 1 x 1
#> flights
#> <int>
#> 1 336776
5.7 改變分組(和過濾器)
在與summarise()
結(jié)合使用時分組最有用纺裁,但也可以使用mutate()
和filter()
進(jìn)行操作:
-
找出每組中最差的成員:
flights_sml %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) #> # A tibble: 3,306 x 7 #> # Groups: year, month, day [365] #> year month day dep_delay arr_delay distance air_time #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 853 851 184 41 #> 2 2013 1 1 290 338 1134 213 #> 3 2013 1 1 260 263 266 46 #> 4 2013 1 1 157 174 213 60 #> 5 2013 1 1 216 222 708 121 #> 6 2013 1 1 255 250 589 115 #> # … with 3,300 more rows
-
查找所有大于閾值的組:
popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) popular_dests #> # A tibble: 332,577 x 19 #> # Groups: dest [77] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 332,571 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
-
標(biāo)準(zhǔn)化以計算每組指標(biāo):
popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay) #> # A tibble: 131,106 x 6 #> # Groups: dest [77] #> year month day dest arr_delay prop_delay #> <int> <int> <int> <chr> <dbl> <dbl> #> 1 2013 1 1 IAH 11 0.000111 #> 2 2013 1 1 IAH 20 0.000201 #> 3 2013 1 1 MIA 33 0.000235 #> 4 2013 1 1 ORD 12 0.0000424 #> 5 2013 1 1 FLL 19 0.0000938 #> 6 2013 1 1 ORD 8 0.0000283 #> # … with 131,100 more rows