最近折騰Shiny的時(shí)候接觸到了一款非常好用的數(shù)據(jù)讀取包寥裂。寫一下備忘錄个榕。
1. 自動(dòng)識(shí)別分隔文件
vroom有自動(dòng)識(shí)別文件格式功能忠烛,所以不管是csv坯墨,還是tsv文件都只需要同一個(gè)讀取指令vroom(”xxx.csv”)
就可以寂汇。
library(vroom)
data <- vroom("flights.tsv")
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
會(huì)跳出來一大段有關(guān)該數(shù)據(jù)各列屬性的信息,不需要的話可以關(guān)掉捣染。
s <- spec(data)
data <- vroom("flights.tsv", col_types = s)
2. 同時(shí)讀取多個(gè)文件
批量讀取數(shù)據(jù)是vroom的一大亮點(diǎn)骄瓣。
files <- fs::dir_ls(glob = "flights_*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv
#> flights_YV.tsv
data <- vroom(files)
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
3. 讀取和寫出壓縮文件
-
vroom_write()
可以直接寫出壓縮文件
vroom_write(flights, "flights.tsv.gz")
# Check file sizes to show file is compressed
fs::file_size(c("flights.tsv", "flights.tsv.gz"))
#> 29.62M 7.87M
# Read the file back in
data <- vroom("flights.tsv.gz")
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
4. 讀取網(wǎng)頁文件
file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
data <- vroom(file)
#> Observations: 32
#> Variables: 12
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
5. 讀取和寫出管道代碼連接數(shù)據(jù)
這個(gè)有點(diǎn)神奇的,完全代替Perl耍攘。
- 提取United Airlines(包含UA字符)的數(shù)據(jù)
# Return only flights on United Airlines
data <- vroom(pipe("grep -w UA flights.tsv"), col_names = names(flights))
#> Observations: 58,665
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
- 又或者可以在寫出壓縮文件的時(shí)候指定壓縮工具
pigz
bench::workout({
vroom_write(flights, "flights.tsv.gz")
vroom_write(flights, pipe("pigz > flights.tsv.gz"))
})
#> # A tibble: 2 x 3
#> exprs process real
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 vroom_write(flights, "flights.tsv.gz") 3.5s 2.69s
#> 2 vroom_write(flights, pipe("pigz > flights.tsv.gz")) 1.54s 975.09ms
6. 選擇數(shù)據(jù)列
- 提取指定列
data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
- 不提取指定列
data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
- 重命名指定列
data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr [ 4]: carrier, tailnum, origin, dest
#> dbl [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#>
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#> plane year month day dep_time sched_dep_time dep_delay arr_time
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 N142… 2013 1 1 517 515 2 830
#> 2 N242… 2013 1 1 533 529 4 850
#> 3 N619… 2013 1 1 542 540 2 923
#> 4 N804… 2013 1 1 544 545 -1 1004
#> 5 N668… 2013 1 1 554 600 -6 812
#> 6 N394… 2013 1 1 554 558 -4 740
#> 7 N516… 2013 1 1 555 600 -5 913
#> 8 N829… 2013 1 1 557 600 -3 709
#> 9 N593… 2013 1 1 557 600 -3 838
#> 10 N3AL… 2013 1 1 558 600 -2 753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>,
#> # arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
7. 修改變量屬性
大多數(shù)情況下vroom可以準(zhǔn)確的判斷變量屬性榕栏,當(dāng)然偶爾也會(huì)出錯(cuò),這個(gè)時(shí)候可以手動(dòng)指定蕾各。當(dāng)然也可以后期用dplyr
改扒磁,當(dāng)然這樣做就會(huì)稍微麻煩點(diǎn)。
屬性對(duì)照式曲,[ ]里的字符是實(shí)際用到的縮寫字符妨托。
-
col_logical()
‘l’, containing onlyT
,F
,TRUE
,FALSE
,1
or0
. -
col_integer()
‘i’, integer values. -
col_double()
‘d’, floating point values. -
col_number()
[n], numbers containing thegrouping_mark
-
col_date(format = "")
[D]: with the locale’sdate_format
. -
col_time(format = "")
[t]: with the locale’stime_format
. -
col_datetime(format = "")
[T]: ISO8601 date times. -
col_factor(levels, ordered)
‘f’, a fixed set of values. -
col_character()
‘c’, everything else. -
col_skip()
‘_, -', don’t import this column. -
col_guess()
‘?', parse using the “best” type based on the input.
用例如下:
# read the 'year' column as an integer
data <- vroom("flights.tsv", col_types = c(year = "i"))
# also skip reading the 'time_hour' column
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_"))
# also read the carrier as a factor
data <- vroom("flights.tsv", col_types = c(year = "i", time_hour = "_", carrier = "f"))
data <- vroom("flights.tsv",
col_types = list(year = col_integer(), time_hour = col_skip(), carrier = col_factor())
)
8. 數(shù)據(jù)讀取速度
一個(gè)字缸榛,快!非常適合機(jī)器學(xué)習(xí)動(dòng)不動(dòng)就幾個(gè)G的數(shù)據(jù)兰伤。
下圖是讀取和輸出1.55G數(shù)據(jù)時(shí)各個(gè)包所用的時(shí)間比較内颗。