背景
最近在做項(xiàng)目抡砂,對不同文件夾下的sas文件進(jìn)行整合聊训。
發(fā)現(xiàn)文件太多礁竞,合并效率非常低,于是請教了一下同事常用的一些加速方法靖苇;
待優(yōu)化的步驟
1.讀取數(shù)據(jù)過程
讀入數(shù)據(jù)過程選取特點(diǎn)列席噩,再進(jìn)行下游操作,這樣一定程度上可以提速(影響不太大)
read_sas(data_file, col_select = NULL )
2.循環(huán)優(yōu)化(for循環(huán))
需要提前創(chuàng)建空表贤壁。循環(huán)進(jìn)行寫入悼枢;
- foreach + doParallel 多核并行方案
library(foreach)
library(doParallel)
#registerDoParallel(no_cores)也可以
registerDoParallel(makeCluster(no_cores))
- foreach()函數(shù)需要%dopar%命令并行化程序
#輸出向量設(shè)置.combine = c
foreach(exponent = 1:5, .combine = c) %dopar% base^exponent
[1] 3 9 27 81 243
#輸出矩陣設(shè)置.combine = rbind
foreach(exponent = 1:5, .combine = rbind) %dopar% base^exponent
[,1]
result.1 3
result.2 9
result.3 27
result.4 81
result.5 243
#輸出列表設(shè)置.combine = list
foreach(exponent = 1:5, .combine = list, .multicombine=TRUE) %dopar% base^exponent
[[1]]
[1] 3
[[2]]
[1] 9
[[3]]
[1] 27
[[4]]
[1] 81
[[5]]
[1] 243
#輸出數(shù)據(jù)框設(shè)置.combine = data.frame
foreach(exponent = 1:5, .combine = data.frame) %dopar% base^exponent
result.1 result.2 result.3 result.4 result.5
1 2 4 8 16 32
#關(guān)閉集群
stopImplicitCluster()
3.mclappy加速
適合管道思路,一直往下寫脾拆;
all_data <- mclapply(
1:length(all_ds_name) %>% set_names(all_ds_name),
function(i) {
read_sas(
str_glue('{folder_rawdata_snap}/{all_ds_name[i]}.sas7bdat'),
col_select = c(
project, Site, StudySiteNumber, Subject, InstanceName, DataPageName, RecordPosition, MinCreated,
instanceId, Folder
)
)
},
mc.cores = 8, mc.preschedule = F
)
4.purrr::map 家族函數(shù)
-
map(.x, .f, ...)
:返回值為列表 -
map_lgl()
馒索、map_int()
给梅、map_dbl()
、map_chr()
:返回特定數(shù)據(jù)類型的向量双揪,在使用map_int()
時需要注意數(shù)據(jù)類型的自動提升問題 -
map_dfc()
动羽、map_dfr()
:對數(shù)據(jù)進(jìn)行col_binding
、row_binding
得到數(shù)據(jù)框
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(~ as.data.frame(t(as.matrix(coef(.)))))
# 返回值
(Intercept) wt
1 39.57120 -5.647025
2 28.40884 -2.780106
3 23.86803 -2.192438
5.分組或排序過程
- group_by 或 arrage 比較多變量渔期,分組較多很慢运吓,dtplyr加速
query_posting_delay_query_detail <- query_detail %>%
select(Study, SiteName, StudyEnvironmentSiteNumber, SubjectName, Folder, Form, Field, `Log#`, MarkingGroupName, QryOpenDate, Name) %>%
filter(!(Name %in% c('Cancelled'))) %>%
filter(!(MarkingGroupName %in% c('Site from System'))) %>%
mutate(QryOpenDate = as_date(mdy_hms(QryOpenDate))) %>%
# use data.table for faster group summarize: 479.858 -> 0.112 seconds, under 95008 groups
lazy_dt() %>%
group_by(Study, SiteName, StudyEnvironmentSiteNumber, SubjectName, Folder, Form, Field, `Log#`, MarkingGroupName) %>%
summarise(QryOpenDate = min(QryOpenDate, na.rm = T)) %>%
ungroup() %>%
as_tibble()
歡迎評論交流~
參考:
http://www.reibang.com/p/c498c9d4cfaf