????????SQL(Structured Query Language, 結(jié)構(gòu)化查詢語(yǔ)言)是用于訪問(wèn)和處理數(shù)據(jù)庫(kù)的標(biāo)準(zhǔn)的計(jì)算機(jī)語(yǔ)言不脯,也是數(shù)據(jù)清洗的神器朦拖。
????????日常的數(shù)據(jù)統(tǒng)計(jì)分析工作中让虐,80%的時(shí)間是在做數(shù)據(jù)清洗诅炉,只有20%的時(shí)間在優(yōu)化模型氓润、分析統(tǒng)計(jì)結(jié)果等赂乐。數(shù)據(jù)清洗的工作的重要性不言而喻,今天先簡(jiǎn)單介紹下數(shù)據(jù)清洗中最重要的工具--SQL咖气。
1. sqldf包簡(jiǎn)介
? ? 通過(guò)sqldf包挨措,可直接在R中的數(shù)據(jù)框(data.frame)(類似數(shù)據(jù)庫(kù)中的表)上進(jìn)行SQL操作,R中數(shù)據(jù)清洗常用的dplyr包的許多函數(shù)操作也是將相應(yīng)的命令轉(zhuǎn)化為SQL語(yǔ)句來(lái)執(zhí)行崩溪。sqldf包支持SQLite(默認(rèn))浅役, H2,MySQL及PostgreSQL作為后臺(tái)來(lái)執(zhí)行SQL語(yǔ)句伶唯。SQLite及H2是兩個(gè)無(wú)服務(wù)器端觉既,無(wú)需配置的輕量級(jí)數(shù)據(jù)庫(kù)管理系統(tǒng),在R中安裝好并加載sqldf包就可以直接使用SQLite數(shù)據(jù)庫(kù)來(lái)操作數(shù)據(jù)乳幸,H2同時(shí)還需要安裝并加載RH2包即可使用(MySQL和PostgreSQL需要繁瑣的服務(wù)器端客戶端的配置)瞪讼。SQLite靈活輕便,應(yīng)用非常廣泛粹断,集成到了許多IOS及Android的app中符欠。SQLite 沒(méi)有一個(gè)單獨(dú)的用于存儲(chǔ)日期和/或時(shí)間的存儲(chǔ)類,但 SQLite 能夠把日期和時(shí)間存儲(chǔ)為文本或數(shù)值類型(在用SQLite處理日期數(shù)據(jù)的時(shí)候要非常小心瓶埋,以后再細(xì)說(shuō))希柿。
2.R中的簡(jiǎn)單SQL語(yǔ)句
主要介紹使用SQLite為后臺(tái)诊沪,通過(guò)sqldf中執(zhí)行SQL語(yǔ)句
library(sqldf)
df2 <- sqldf('select * from df',drv = 'SQLite')
##這里的drv不寫默認(rèn)就是SQLite狡汉。選擇df數(shù)據(jù)庫(kù)所有變量(* 代表所有變量)
Examples
在R中不使用SQL與使用SQL比較
2.1 head
> df1 <- head(warpbreaks, 5)
> df2 <- sqldf('select * from warpbreaks limit 5;')
> identical(df1, df2)
[1] TRUE
2.2 subset
> data(farms, package = 'MASS')
> df1 <- subset(farms, Manag %in% c('SF', 'BF'))
> df2 <- sqldf("select * from farms where Manag in ('SF', 'BF')")
> row.names(df1) <- NULL
> identical(df1, df2)
[1] TRUE
> df1 <- subset(warpbreaks, breaks >= 20 & breaks <= 30)
> df2 <- sqldf('select * from warpbreaks where breaks between 20 and 30;')
> row.names(df1) <- NULL
> identical(df1, df2)
[1] TRUE
> df1 <- subset(farms, Mois == 'M1')
> df2 <- sqldf('select * from farms where Mois = "M1"', row.names = T)
> identical(df1, df2)
[1] TRUE
2.3 rbind
> df1 <- subset(farms, Mois == 'M1')
> df2 <- subset(farms, Mois == 'M2')
> df3 <- sqldf('select * from farms where Mois = "M1"', row.names = T)
> df4 <- sqldf('select * from farms where Mois = "M2"', row.names = T)
> df12 <- rbind(df1, df2)
> df34 <- sqldf('select * from df3 union all select * from df4', row.names = T)
> identical(df12, df34)
[1] TRUE
2.4 aggregate
> df1 <- aggregate(iris[1:2], iris[5], mean)
> df2 <- sqldf('select Species, avg("Sepal.Length") as "Sepal.Length",?
+? ? ? ? ? ? ? avg("Sepal.Width") as "Sepal.Width" from iris group by Species;')
> all.equal(df1, df2)
[1] TRUE
2.5 order?
> df1 <- head(warpbreaks[order(warpbreaks$breaks, decreasing = T), ], 3)
> df2 <- sqldf('select * from warpbreaks order by breaks desc limit 3;')
> row.names(df1) <- NULL
> identical(df1, df2)
[1] TRUE
今天先到這里娄徊,有空再寫~~~