筆記說明
數(shù)據(jù)清理可能是數(shù)據(jù)分析中耗時(shí)占比最大的操作了八毯。dplyr包是一個(gè)用于數(shù)據(jù)清理的高效r包悲靴,也是tidyverse
的核心包之一黔龟。
dplyr包的常用操作包括:
mutate()
adds new variables that are functions of existing variables
select()
picks variables based on their names.
filter()
picks cases based on their values.
summarise()
reduces multiple values down to a single summary.
arrange()
changes the ordering of the rows.
group_by()
allows you to perform any operation “by group”
主要參考:https://b-rodrigues.github.io/modern_R/descriptive-statistics-and-data-manipulation.html#the-tidyverses-enfant-prodige-dplyr
推薦閱讀:https://dplyr.tidyverse.org/
準(zhǔn)備工作
加載dplyr包
library(dplyr)
數(shù)據(jù)準(zhǔn)備,我們使用plm
包中的Gasoline
數(shù)據(jù)集作為示例數(shù)據(jù)犁罩。該數(shù)據(jù)集包含1960至1978年間18個(gè)國(guó)家的汽油消耗量插佛。原始數(shù)據(jù)是一個(gè)data.frame對(duì)象婶熬,我們用as_tibble()
將其轉(zhuǎn)換為一個(gè)tibble對(duì)象剑勾。
可以把tibble理解成一個(gè)優(yōu)化版的data.frame。dplyr包中的各個(gè)函數(shù)可以作用于data.frame對(duì)象赵颅,也可以作用于tibble對(duì)象虽另。
# 數(shù)據(jù)準(zhǔn)備
install.packages("plm")
data(Gasoline, package = "plm")
gasoline <- as_tibble(Gasoline)
用filter()函數(shù)篩選觀測(cè)
filter()
篩選出滿足給定條件的觀測(cè)(觀測(cè)指數(shù)據(jù)集的行)。例如我們想篩選出gasoline數(shù)據(jù)集中年份在1969年的觀測(cè):
filter(gasoline, year == 1969)
## # A tibble: 18 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 3 CANADA 1969 4.86 -5.56 -1.04 -8.10
## 4 DENMARK 1969 4.17 -5.72 -0.407 -8.47
## 5 FRANCE 1969 3.77 -5.84 -0.315 -8.37
## 6 GERMANY 1969 3.90 -5.83 -0.589 -8.44
## 7 GREECE 1969 4.89 -6.59 -0.180 -10.7
## 8 IRELAND 1969 4.21 -6.38 -0.272 -8.95
## 9 ITALY 1969 3.74 -6.28 -0.248 -8.67
## 10 JAPAN 1969 4.52 -6.16 -0.417 -9.61
## 11 NETHERLA 1969 3.99 -5.88 -0.417 -8.63
## 12 NORWAY 1969 4.09 -5.74 -0.338 -8.69
## 13 SPAIN 1969 3.99 -5.60 0.669 -9.72
## 14 SWEDEN 1969 3.99 -7.77 -2.73 -8.20
## 15 SWITZERL 1969 4.21 -5.91 -0.918 -8.47
## 16 TURKEY 1969 5.72 -7.39 -0.298 -12.5
## 17 U.K. 1969 3.95 -6.03 -0.383 -8.47
## 18 U.S.A. 1969 4.84 -5.41 -1.22 -7.79
用管道操作符改寫上面一行代碼:
gasoline %>% filter(year == 1969)
效果是一樣的饺谬。管道操作符%>%
的作用就是把符號(hào)前的對(duì)象作為第一個(gè)參數(shù)傳遞給符號(hào)后的函數(shù)捂刺。x %>% f(y)
等價(jià)于f(x,y)
假設(shè)我們想篩選出年份在1969和1973之間的觀測(cè),可以用%in%
操作符或者between()
來實(shí)現(xiàn)募寨。
%in%
操作符判斷前面一個(gè)向量?jī)?nèi)的元素是否在后面一個(gè)向量中族展。
between(x, left, right)
等價(jià)于x >= left & x <= right
(dplyr包的函數(shù))
gasoline %>% filter(year %in% seq(1969, 1973))
gasoline %>% filter(between(year, 1969, 1973))
這兩行代碼結(jié)果是一樣的:
## # A tibble: 90 x 6
## country year lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## 2 AUSTRIA 1970 4.08 -6.08 -0.597 -8.73
## 3 AUSTRIA 1971 4.11 -6.04 -0.654 -8.64
## 4 AUSTRIA 1972 4.13 -5.98 -0.596 -8.54
## 5 AUSTRIA 1973 4.20 -5.90 -0.594 -8.49
## 6 BELGIUM 1969 3.85 -5.86 -0.355 -8.52
## 7 BELGIUM 1970 3.87 -5.80 -0.378 -8.45
## 8 BELGIUM 1971 3.87 -5.76 -0.399 -8.41
## 9 BELGIUM 1972 3.91 -5.71 -0.311 -8.36
## 10 BELGIUM 1973 3.90 -5.64 -0.373 -8.31
## # ... with 80 more rows
用select()函數(shù)篩選變量
select()
可以用來提取指定變量:
gasoline %>% select(country, year, lrpmg)
## # A tibble: 342 x 3
## country year lrpmg
## <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
select()
也可以用來刪除指定變量:
gasoline %>% select(-country, -year, -lrpmg)
## # A tibble: 342 x 3
## lgaspcar lincomep lcarpcap
## <dbl> <dbl> <dbl>
## 1 4.17 -6.47 -9.77
## 2 4.10 -6.43 -9.61
## 3 4.07 -6.41 -9.46
## 4 4.06 -6.37 -9.34
## 5 4.04 -6.32 -9.24
## 6 4.03 -6.29 -9.12
## 7 4.05 -6.25 -9.02
## 8 4.05 -6.23 -8.93
## 9 4.05 -6.21 -8.85
## 10 4.05 -6.15 -8.79
## # ... with 332 more rows
提取變量時(shí)可以用new_name = old_name
的表達(dá)方式對(duì)變量進(jìn)行重新命名:
gasoline %>% select(country, date = year, lrpmg)
## # A tibble: 342 x 3
## country date lrpmg
## <fct> <int> <dbl>
## 1 AUSTRIA 1960 -0.335
## 2 AUSTRIA 1961 -0.351
## 3 AUSTRIA 1962 -0.380
## 4 AUSTRIA 1963 -0.414
## 5 AUSTRIA 1964 -0.445
## 6 AUSTRIA 1965 -0.497
## 7 AUSTRIA 1966 -0.467
## 8 AUSTRIA 1967 -0.506
## 9 AUSTRIA 1968 -0.522
## 10 AUSTRIA 1969 -0.559
## # ... with 332 more rows
如果只是單純的改變量名字,可以用rename()
gasoline %>% rename(nation = country, date = year)
## # A tibble: 342 x 6
## nation date lgaspcar lincomep lrpmg lcarpcap
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AUSTRIA 1960 4.17 -6.47 -0.335 -9.77
## 2 AUSTRIA 1961 4.10 -6.43 -0.351 -9.61
## 3 AUSTRIA 1962 4.07 -6.41 -0.380 -9.46
## 4 AUSTRIA 1963 4.06 -6.37 -0.414 -9.34
## 5 AUSTRIA 1964 4.04 -6.32 -0.445 -9.24
## 6 AUSTRIA 1965 4.03 -6.29 -0.497 -9.12
## 7 AUSTRIA 1966 4.05 -6.25 -0.467 -9.02
## 8 AUSTRIA 1967 4.05 -6.23 -0.506 -8.93
## 9 AUSTRIA 1968 4.05 -6.21 -0.522 -8.85
## 10 AUSTRIA 1969 4.05 -6.15 -0.559 -8.79
## # ... with 332 more rows
select()
可以用來調(diào)整變量的順序:
gasoline %>% select(year, country, lrpmg, everything())
## # A tibble: 342 x 6
## year country lrpmg lgaspcar lincomep lcarpcap
## <int> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1960 AUSTRIA -0.335 4.17 -6.47 -9.77
## 2 1961 AUSTRIA -0.351 4.10 -6.43 -9.61
## 3 1962 AUSTRIA -0.380 4.07 -6.41 -9.46
## 4 1963 AUSTRIA -0.414 4.06 -6.37 -9.34
## 5 1964 AUSTRIA -0.445 4.04 -6.32 -9.24
## 6 1965 AUSTRIA -0.497 4.03 -6.29 -9.12
## 7 1966 AUSTRIA -0.467 4.05 -6.25 -9.02
## 8 1967 AUSTRIA -0.506 4.05 -6.23 -8.93
## 9 1968 AUSTRIA -0.522 4.05 -6.21 -8.85
## 10 1969 AUSTRIA -0.559 4.05 -6.15 -8.79
## # ... with 332 more rows
代碼中的everything()
的作用是選擇所有變量拔鹰。它是"select helper"中的一員仪缸。
select helper是一組只在select()
中起作用的特殊函數(shù),它們的功能是方便地根據(jù)變量名選擇變量, select helper包括:
starts_with(): Starts with a prefix.
ends_with(): Ends with a suffix.
contains(): Contains a literal string.
matches(): Matches a regular expression.
num_range(): Matches a numerical range like x01, x02, x03.
one_of(): Matches variable names in a character vector.
everything(): Matches all variables.
last_col(): Select last variable, possibly with an offset.