【R語言】R包dplyr(一):數(shù)據(jù)轉(zhuǎn)換

日常在工作中會碰到很多數(shù)據(jù)批量處理的問題茎毁，有的時候單獨(dú)造輪子很費(fèi)時間，這個時候我發(fā)現(xiàn)了dplyr這個R包，能幫助你解決數(shù)據(jù)處理中的絕大多數(shù)難題允乐。dplyr是tidyverse中的一個核心包篓吁，用來進(jìn)行數(shù)據(jù)操作茫因。主要包括以下5個核心函數(shù)。

filter() 按值篩選觀測

arrange() 對行進(jìn)行重新排序

select() 按名稱選取變量

mutate() 使用現(xiàn)有變量的函數(shù)創(chuàng)建新變量

summarize()將多個值總結(jié)為一個摘要統(tǒng)計(jì)量

這些函數(shù)都可以和group_by()函數(shù)聯(lián)合起來使用杖剪，group_by()可以改變以上每個函數(shù)的作用范圍冻押，讓其在整個數(shù)據(jù)集上的操作變?yōu)樵诿總€分組上分別操作，這五個函數(shù)的工作方式都是相同的:

1.第一個參數(shù)是一個數(shù)據(jù)框盛嘿。

2.隨后的采納數(shù)使用變量名稱(不帶引號)描述了在數(shù)據(jù)框上進(jìn)行的操作洛巢。

3.輸出結(jié)果是一個新的數(shù)據(jù)框。

Installation

# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

Demo

下面將以一個航班信息的數(shù)據(jù)集來演示一下dplyr這個包的用法次兆。

#安裝并加載這個數(shù)據(jù)集
install.packages('nycflights13')
library(nycflights13)

1.1 使用filter()篩選行

filter()函數(shù)可以基于觀測的值篩選出一個觀測子集稿茉。第一個參數(shù)是數(shù)據(jù)框名稱，第二個參數(shù)及隨后的參數(shù)是用來篩選數(shù)據(jù)框的表達(dá)式类垦。

#篩選出1月1日出發(fā)的航班
> a <- filter(flights,month==1,day==1)
> head(a)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
1  2013     1     1      517            515         2      830            819        11 UA     
2  2013     1     1      533            529         4      850            830        20 UA     
3  2013     1     1      542            540         2      923            850        33 AA     
4  2013     1     1      544            545        -1     1004           1022       -18 B6     
5  2013     1     1      554            600        -6      812            837       -25 DL     
6  2013     1     1      554            558        -4      740            728        12 UA     
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.1.1 比較運(yùn)算符

R提供了一套標(biāo)準(zhǔn)的比較運(yùn)算符：>,>=,<,<=,!=(不等于),==(等于)狈邑。

1.1.2 邏輯運(yùn)算符

filter()中的多個參數(shù)是由“與”組合起來的：每個表達(dá)式都必須為真才能讓一行觀測包含在輸出中。如果要實(shí)現(xiàn)其他類型的組合蚤认，你需要使用布爾運(yùn)算符: &表示"與"(也就是交集), | 表示“或”, ! 表示"非"米苹。

#找出11月或12月出發(fā)的航班
> b <- filter(flights,month==11 | month ==12)
> head(b)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
1  2013    11     1        5           2359         6      352            345         7 B6     
2  2013    11     1       35           2250       105      123           2356        87 B6     
3  2013    11     1      455            500        -5      641            651       -10 US     
4  2013    11     1      539            545        -6      856            827        29 UA     
5  2013    11     1      542            545        -3      831            855       -24 AA     
6  2013    11     1      549            600       -11      912            923       -11 UA     
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

上述代碼的簡寫形式為:

c <- filter(flights,month %in% c(11,12))

其中，x %in% y表示選出x是y中的一個值時的所有行砰琢。

filter()只能篩選出條件為TRUE的行蘸嘶；它會排除那些條件為FALSE和NA的行，如果想保留缺失值陪汽，可以明確指出:

> df <- tibble(x=c(1,NA,3)) #tibble()用于構(gòu)建一個數(shù)據(jù)框
> head(df)
# A tibble: 3 x 1
      x
  <dbl>
1     1
2    NA
3     3
> filter(df,x>1)
# A tibble: 1 x 1
      x
  <dbl>
1     3
> filter(df,is.na(x) | x > 1)
# A tibble: 2 x 1
      x
  <dbl>
1    NA
2     3

1.2 使用`arrange()`排列行

arrange()函數(shù)的工作方式與filter()函數(shù)非常相似训唱，但前者不選擇行，而是改變行的順序挚冤。它接受一個數(shù)據(jù)框和一組作為排序依據(jù)的列名作為參數(shù)况增。如果列名不止一個，那么就使用后面的列在前面排序的基礎(chǔ)上繼續(xù)排序训挡。

#將flights按年月日排序
> d <- arrange(flights,year,month,day)
> head(d)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
1  2013     1     1      517            515         2      830            819        11 UA     
2  2013     1     1      533            529         4      850            830        20 UA     
3  2013     1     1      542            540         2      923            850        33 AA     
4  2013     1     1      544            545        -1     1004           1022       -18 B6     
5  2013     1     1      554            600        -6      812            837       -25 DL     
6  2013     1     1      554            558        -4      740            728        12 UA     
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

使用desc()可以按列進(jìn)行降序排序澳骤。

> e <- arrange(flights,desc(arr_delay))
> head(e)
# A tibble: 6 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
1  2013     1     9      641            900      1301     1242           1530      1272 HA     
2  2013     6    15     1432           1935      1137     1607           2120      1127 MQ     
3  2013     1    10     1121           1635      1126     1239           1810      1109 MQ     
4  2013     9    20     1139           1845      1014     1457           2210      1007 AA     
5  2013     7    22      845           1600      1005     1044           1815       989 MQ     
6  2013     4    10     1100           1900       960     1342           2211       931 DL     
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

缺失值總是排在最后面。

> df <- tibble(x=c(5,NA,2))
> arrange(df,x)
# A tibble: 3 x 1
      x
  <dbl>
1     2
2     5
3    NA
#將缺失值排在前面
> arrange(df,desc(is.na(x)))
# A tibble: 3 x 1
      x
  <dbl>
1    NA
2     5
3     2

1.3 使用select()選擇列

通過基于變量名的操作澜薄，select()可以讓你快速生成一個有用的變量子集为肮。

#按名稱選擇列
> f <- select(flights,year,month,day)
> head(f)
# A tibble: 6 x 3
   year month   day
  <int> <int> <int>
1  2013     1     1
2  2013     1     1
3  2013     1     1
4  2013     1     1
5  2013     1     1
6  2013     1     1
# 選擇‘year’和‘day’之間的所有列(包括year和day)
> g <- select(flights,year:day)
> head(g)
# A tibble: 6 x 3
   year month   day
  <int> <int> <int>
1  2013     1     1
2  2013     1     1
3  2013     1     1
4  2013     1     1
5  2013     1     1
6  2013     1     1
#選擇不再‘year’和‘day’之間的所有列(不包括‘year’和‘day’)
> h <- select(flights,-(year:day))
> head(h)
# A tibble: 6 x 16
  dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
     <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>  
1      517            515         2      830            819        11 UA        1545 N14228 
2      533            529         4      850            830        20 UA        1714 N24211 
3      542            540         2      923            850        33 AA        1141 N619AA 
4      544            545        -1     1004           1022       -18 B6         725 N804JB 
5      554            600        -6      812            837       -25 DL         461 N668DN 
6      554            558        -4      740            728        12 UA        1696 N39463 
# ... with 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

還可以在select()函數(shù)中使用一些輔助函數(shù)。

start_with("abc") 匹配以abc開頭的名稱

ends_with("xyz") 匹配以xyz結(jié)尾的名稱

contains("ijk") 匹配包含ijk的名稱

matches("(.) \ \ 1") 選擇匹配正則表達(dá)式的那些變量肤京，這個正則表達(dá)式會匹配名稱中含有重復(fù)字符的變量颊艳。

num_range('x',1:3) 匹配x1，x2和x3

將select()函數(shù)和everything()輔助函數(shù)結(jié)合起來使用，從而將幾個變量移到數(shù)據(jù)框的開頭棋枕。

> i <- select(flights,time_hour,air_time,everything())
> head(i)
# A tibble: 6 x 19
  time_hour           air_time  year month   day dep_time sched_dep_time dep_delay arr_time
  <dttm>                 <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
1 2013-01-01 05:00:00      227  2013     1     1      517            515         2      830
2 2013-01-01 05:00:00      227  2013     1     1      533            529         4      850
3 2013-01-01 05:00:00      160  2013     1     1      542            540         2      923
4 2013-01-01 05:00:00      183  2013     1     1      544            545        -1     1004
5 2013-01-01 06:00:00      116  2013     1     1      554            600        -6      812
6 2013-01-01 05:00:00      150  2013     1     1      554            558        -4      740
# ... with 10 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>,
#   minute <dbl>

1.4 使用`mutate()`添加新變量

mutate()總是能夠添加新列到數(shù)據(jù)集的最后白修，查看所有列的最簡單的方式就是使用view()函數(shù)。

#先產(chǎn)生一個狹窄的數(shù)據(jù)集
> flights_sml <- select(flights,year:day,ends_with("delay"),distance,air_time)
> head(flights_sml)
# A tibble: 6 x 7
   year month   day dep_delay arr_delay distance air_time
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
1  2013     1     1         2        11     1400      227
2  2013     1     1         4        20     1416      227
3  2013     1     1         2        33     1089      160
4  2013     1     1        -1       -18     1576      183
5  2013     1     1        -6       -25      762      116
6  2013     1     1        -4        12      719      150
#添加新列
> j <- mutate(flights_sml,gain=arr_delay - dep_delay,speed= distance/air_time * 60)
> head(j)
# A tibble: 6 x 9
   year month   day dep_delay arr_delay distance air_time  gain speed
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
1  2013     1     1         2        11     1400      227     9  370.
2  2013     1     1         4        20     1416      227    16  374.
3  2013     1     1         2        33     1089      160    31  408.
4  2013     1     1        -1       -18     1576      183   -17  517.
5  2013     1     1        -6       -25      762      116   -19  394.
6  2013     1     1        -4        12      719      150    16  288.

創(chuàng)建新列后可以立馬使用戒悠，如果只想保留新變量熬荆，可以使用transmute()函數(shù)。

> k <- transmute(flights,gain=arr_delay-dep_delay,hours=air_time/60,gain_per_hour=gain/hours)
> head(k)
# A tibble: 6 x 3
   gain hours gain_per_hour
  <dbl> <dbl>         <dbl>
1     9  3.78          2.38
2    16  3.78          4.23
3    31  2.67         11.6 
4   -17  3.05         -5.57
5   -19  1.93         -9.83
6    16  2.5           6.4

1.5 使用summarize()進(jìn)行分組摘要

summarize()可以將數(shù)據(jù)框折疊成一行绸狐。

> l <- summarise(flights,delay=mean(dep_delay,na.rm = T))
> head(l)
# A tibble: 1 x 1
  delay
  <dbl>
1  12.6

如果summarize()不與group_by()一起使用卤恳，那么它本身也沒有大用。group_by()可以將分析單位從整個數(shù)據(jù)集更改為單個分組寒矿。

#分組后就變成year突琳，month，day相同的為一組符相，來計(jì)算它們的平均延誤時間
> by_day <- group_by(flights,year,month,day) 
> m <- summarise(by_day,delay=mean(dep_delay,na.rm = T)) #通過設(shè)置na.rm=T拆融，可以在計(jì)算前去除缺失值，否則會得到大量的缺失值啊终。
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
> head(m)
# A tibble: 6 x 4
# Groups:   year, month [1]
   year month   day delay
  <int> <int> <int> <dbl>
1  2013     1     1 11.5 
2  2013     1     2 13.9 
3  2013     1     3 11.0 
4  2013     1     4  8.95
5  2013     1     5  5.73
6  2013     1     6  7.15

使用管道%>%提高代碼的可讀性镜豹，重點(diǎn)在于轉(zhuǎn)換的過程而不是轉(zhuǎn)換的對象。在閱讀代碼的時候蓝牲，%>%讀作然后趟脂。以下這個例子研究每個目的地的距離和平均延誤時間之間的關(guān)系。

> delay <- flights %>%
  group_by(dest) %>%
  summarise(
    count=n(), #對分組后的dest各元素個數(shù)進(jìn)行計(jì)數(shù)
    dist=mean(distance,na.rm = T),
    delay=mean(arr_delay,na.rm = T)
  ) %>%
  filter(count > 20,dest != "HNL")
> head(delay)
# A tibble: 6 x 4
  dest  count  dist delay
  <chr> <int> <dbl> <dbl>
1 ABQ     254 1826   4.38
2 ACK     265  199   4.85
3 ALB     439  143  14.4 
4 ATL   17215  757. 11.3 
5 AUS    2439 1514.  6.02
6 AVL     275  584.  8.00

尋找航班數(shù)量和平均延誤時間之間的關(guān)系例衍。

> library(ggplot2)
> not_cancelled <- flights %>%
  filter(!is.na(dep_delay),!is.na(arr_delay))
> delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay=mean(arr_delay,na.rm = T),
    n=n
  )
> ggplot(data = delays,mapping = aes(x=n,y=delay))+
  geom_point()

查看上述圖形時昔期，通常應(yīng)該篩選掉那些觀測數(shù)量非常少的分組，這樣就可以避免受到特別小的分組中極端變動的影響佛玄，進(jìn)而更好的發(fā)現(xiàn)數(shù)據(jù)模式硼一。

delays %>%
  filter(n>25) %>%
  ggplot(mapping = aes(x=n,y=delay)) +
  geom_point()

另一個案例：使用Lahman包中的數(shù)據(jù)來計(jì)算大聯(lián)盟的每個棒球隊(duì)員的打擊率(安打數(shù)/打數(shù))。

#轉(zhuǎn)換成tibble梦抢，以便輸出更美觀
> install.packages("Lahman")
> library(Lahman)
> batting <- as_tibble(Lahman::Batting)
> batters <- batting %>%
  group_by(playerID) %>%
  summarise(
    ba=sum(H,na.rm = T)/sum(AB,na.rm = T),
    ab=sum(AB,na.rm = T)
  )
> batters %>%
  filter(ab>100) %>%
  ggplot(mapping = aes(x=ab,y=ba))+
  geom_point()+
  geom_smooth(se=F)

1.6 常用的摘要函數(shù)

只使用均值般贼，計(jì)數(shù)和求和是遠(yuǎn)遠(yuǎn)不夠的，R中還提供了很多其他的常用摘要函數(shù)奥吩。

1.6.1 位置度量

median(x)用來求中位數(shù)具伍，50%的x大于它，同時50%的x小于它圈驼。

將聚合函數(shù)和邏輯篩選組合起來使用。

> not_cancelled %>%
> group_by(year,month,day) %>%
  summarise(
    #平均延誤時間
    avg_delay1=mean(arr_delay),
    #平均正延誤時間
    avg_delay2=mean(arr_delay[arr_delay>0])
  )
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 5
# Groups:   year, month [12]
    year month   day avg_delay1 avg_delay2
   <int> <int> <int>      <dbl>      <dbl>
 1  2013     1     1     12.7         32.5
 2  2013     1     2     12.7         32.0
 3  2013     1     3      5.73        27.7
 4  2013     1     4     -1.93        28.3
 5  2013     1     5     -1.53        22.6
 6  2013     1     6      4.24        24.4
 7  2013     1     7     -4.95        27.8
 8  2013     1     8     -3.23        20.8
 9  2013     1     9     -0.264       25.6
10  2013     1    10     -5.90        27.3
# ... with 355 more rows

1.6.2 秩的度量:min(x),quantile(x,0.25)和max(x)

quantile(x,0.25)會找出x中從小到大順序大于前25%而小于后75%的值望几。

示例：找出每天最早和最晚的航班何時出發(fā).

> n <- not_cancelled %>%
  group_by(year,month,day) %>%
  summarise(
    first=min(dep_time),
    last=max(dep_time)
  )
> head(n)
# A tibble: 6 x 5
# Groups:   year, month [1]
   year month   day first  last
  <int> <int> <int> <int> <int>
1  2013     1     1   517  2356
2  2013     1     2    42  2354
3  2013     1     3    32  2349
4  2013     1     4    25  2358
5  2013     1     5    14  2357
6  2013     1     6    16  2355

1.6.3 計(jì)數(shù)

前面已經(jīng)使用過n()來返回當(dāng)前分組的大小绩脆。如果想計(jì)算出非缺失值的數(shù)量，可以使用sum(!is.na(x))。如果想要計(jì)算出唯一值的數(shù)量靴迫，可以使用n_distinct(x)惕味。

#查看哪個目的地具有最多的航空公司
> o <- not_cancelled %>%
  group_by(dest) %>%
  summarise(carriers=n_distinct(carrier)) %>% #只計(jì)算唯一值
  arrange(desc(carriers))
> head(o)
# A tibble: 6 x 2
  dest  carriers
  <chr>    <int>
1 ATL          7
2 BOS          7
3 CLT          7
4 ORD          7
5 TPA          7
6 AUS          6

###關(guān)于n_distinct()的理解，可以運(yùn)行一下代碼玉锌，實(shí)際上是去除唯一值的重復(fù)值名挥，只看唯一值的數(shù)量。
> x <- sample(1:10, 1e5, rep = TRUE)
> length(unique(x))
> n_distinct(x) #與上一行代碼相當(dāng)

因?yàn)橛?jì)數(shù)太常用了主守，所以dplyr提供了一個簡單的輔助函數(shù)禀倔，用于只需要計(jì)數(shù)的情況。

> not_cancelled %>%
   count(dest)
#計(jì)算每架飛機(jī)飛行的總里程参淫，實(shí)際上就是求和救湖。
> not_cancelled %>%
    count(tailnum,wt=distance)

1.6.4 邏輯值的計(jì)數(shù)和比例

當(dāng)與數(shù)值型函數(shù)一同使用時，TRUE會轉(zhuǎn)換成1涎才，F(xiàn)ALSE會轉(zhuǎn)換成0鞋既，這使得sum()和mean()非常適用于邏輯值：sum(x)可以找出x中TRUE的數(shù)量，mean(x)則可以找出比例耍铜。

#早上五點(diǎn)前出發(fā)的有多少架航班
> not_cancelled %>%
+   group_by(year,month,day) %>%
+   summarise(n_nearly=sum(dep_time<500))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day n_nearly
   <int> <int> <int>    <int>
 1  2013     1     1        0
 2  2013     1     2        3
 3  2013     1     3        4
 4  2013     1     4        3
 5  2013     1     5        3
 6  2013     1     6        2
 7  2013     1     7        2
 8  2013     1     8        1
 9  2013     1     9        3
10  2013     1    10        3
# ... with 355 more rows

#延誤超過1小時的航班比例是多少
> not_cancelled %>%
+   group_by(year,month,day) %>%
+   summarise(hour_perc=mean(arr_delay>60))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day hour_perc
   <int> <int> <int>     <dbl>
 1  2013     1     1    0.0722
 2  2013     1     2    0.0851
 3  2013     1     3    0.0567
 4  2013     1     4    0.0396
 5  2013     1     5    0.0349
 6  2013     1     6    0.0470
 7  2013     1     7    0.0333
 8  2013     1     8    0.0213
 9  2013     1     9    0.0202
10  2013     1    10    0.0183
# ... with 355 more rows

1.7 分組新變量(和篩選器)

雖然與summarize()函數(shù)結(jié)合起來使用是最有效的邑闺，但分組也可以與mutate()和filter()函數(shù)結(jié)合，以完成非常便捷的操作棕兼。

示例一：找出每個分組中最差的成員陡舅。

> flights_sml %>%
+   group_by(year,month,day) %>%
+   filter(rank(desc(arr_delay))<5)
# A tibble: 1,464 x 7
# Groups:   year, month, day [365]
    year month   day dep_delay arr_delay distance air_time
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
 1  2013     1     1       853       851      184       41
 2  2013     1     1       290       338     1134      213
 3  2013     1     1       260       263      266       46
 4  2013     1     1       379       456     1092      222
 5  2013     1     2       268       288     1092      203
 6  2013     1     2       334       323      937      150
 7  2013     1     2       337       368     2586      346
 8  2013     1     2       379       359     1620      228
 9  2013     1     3       174       176     1008      152
10  2013     1     3       268       270     1069      158
# ... with 1,454 more rows

其中對rank()函數(shù)的使用做一個補(bǔ)充。

rank()函數(shù)是對一維度數(shù)組程储、向量x 進(jìn)行排序蹭沛。若x 為數(shù)值，則按照小數(shù)在前大數(shù)在后的原則進(jìn)行排序章鲤。

rank() 將數(shù)據(jù)分為確定值與缺失值兩種摊灭。缺失值可按先后排在確定值之前(na.last = FALSE)；也可排在之后(na.last = TRUE),败徊；也可保留帚呼，不參與排序(na.last = "keep")。

"first" 是最基本的排序皱蹦，小數(shù)在前大數(shù)在后煤杀，相同元素先者在前后者在后。

"max" 是相同元素都取該組中最好的水平沪哺，即通常所講的并列排序沈自。

"min" 是相同元素都取該組中最差的水平，可以增大序列的等級差異辜妓。

"average" 是相同元素都取該組中的平均水平枯途，該水平可能是個小數(shù)忌怎。

"random" 是相同元素隨機(jī)編排次序，避免了“先到先得”酪夷，“權(quán)重”優(yōu)于“先后順序”的機(jī)制增大了隨機(jī)的程度榴啸。

> rank(t <- c(6.8, 8.1, 7.2))
[1] 1 3 2

示例二：找出大于某個閾值的所有分組：

 > popular_dests <- flights %>%
+   group_by(dest) %>%
+   filter(n()>365)
> head(popular_dests)
# A tibble: 6 x 19
# Groups:   dest [5]
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
1  2013     1     1      517            515         2      830            819        11 UA     
2  2013     1     1      533            529         4      850            830        20 UA     
3  2013     1     1      542            540         2      923            850        33 AA     
4  2013     1     1      544            545        -1     1004           1022       -18 B6     
5  2013     1     1      554            600        -6      812            837       -25 DL     
6  2013     1     1      554            558        -4      740            728        12 UA     
# ... with 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

參考鏈接：
1.https://dplyr.tidyverse.org/
2.《R for Data Science》