迷茫的日子里也不能忘了要學(xué)習(xí)新知識(shí)技能以及總結(jié)歸納啊鲜戒。最近開(kāi)始學(xué)習(xí)《R數(shù)據(jù)科學(xué)》中文版,這確實(shí)是本好書(shū)抹凳,有些知識(shí)點(diǎn)的層次關(guān)系及解釋讓我明白了許多之前對(duì)R一知半解的地方遏餐。以接下來(lái)的一系列筆記來(lái)歸納建立自己的知識(shí)體系,并一起分享交流學(xué)習(xí)赢底。
第一章: 使用ggplot2進(jìn)行數(shù)據(jù)可視化
ggplot()
失都,在此函數(shù)種添加的映射會(huì)作為全局變量應(yīng)用到圖中的每個(gè)幾何對(duì)象種。圖層
geom_point
點(diǎn)圖層映射數(shù)據(jù)為圖形屬性
mapping=aes()
幸冻,要想將圖形屬性映射為變量粹庞,需要在函數(shù)aes()中將圖形屬性的名稱(chēng)和變量的名稱(chēng)關(guān)聯(lián)起來(lái)。標(biāo)度變化:將變量(數(shù)據(jù))分配唯一的圖形屬性水平嘁扼。
手動(dòng)設(shè)置圖形屬性信粮,此是在geom_point()層面黔攒。此時(shí)趁啸,這個(gè)顏色是不會(huì)傳達(dá)變量數(shù)據(jù)的信息。
分層facet:
facet_grid()
可以通過(guò)兩個(gè)變量對(duì)圖分層`facet_grid(drvcyl)或(.cyl)分組
aes(group)
此種按照?qǐng)D形屬性的分組不用添加圖例督惰,也不用為幾何對(duì)象添加區(qū)分特征-
統(tǒng)計(jì)變換:繪圖時(shí)用來(lái)計(jì)算新數(shù)據(jù)的算法稱(chēng)為stat(statistical transformation,統(tǒng)計(jì)變化)不傅。比如對(duì)于
geom_bar()
默認(rèn)是只對(duì)一個(gè)數(shù)據(jù)x映射,其統(tǒng)計(jì)變化后生成數(shù)據(jù)x種的每個(gè)值的count數(shù)赏胚。- 每個(gè)幾何對(duì)象函數(shù)都有一個(gè)默認(rèn)的統(tǒng)計(jì)變換访娶,每個(gè)統(tǒng)計(jì)變換函數(shù)都有一個(gè)默認(rèn)的幾何對(duì)象。
- 如需要展示二維數(shù)據(jù)觉阅,geom_bar(mapping=aes(x=a,y=b),stat="identity ")
-
圖形屬性/位置調(diào)整:
-
color
,fill
- 位置調(diào)整參數(shù)
position
有三個(gè)選項(xiàng):"identity","fill","dodge" -
position="dodge"
參數(shù)可分組顯示數(shù)據(jù)崖疤,將每組種的條形依次并列放置,可以輕松比較每個(gè)條形表示的具體數(shù)值典勇。 - 數(shù)據(jù)的聚集模式無(wú)法很好確定劫哼,因?yàn)榇嬖跀?shù)據(jù)的過(guò)繪制問(wèn)題(很多彼此十分僅的點(diǎn)重疊了)
position="jitter"
對(duì)于geom_position()函數(shù)來(lái)說(shuō),jitter的位置方式為抖動(dòng)會(huì)排除過(guò)繪制問(wèn)題
-
-
坐標(biāo)系:
-
coord_flip()
函數(shù)可以交換x軸和y軸 -
labs()
:modify axis, legend, and plot labels.
-
mpg
str(mpg)
data<- mpg
?mpg ##查看mpg數(shù)據(jù)的說(shuō)明
ggplot(data = mpg)+geom_point(aes(x=displ,y=hwy))
ggplot(mpg)+geom_point(mapping = aes(x=displ,y=hwy,color=class),color="#EF5C4E",shape=19)
ggplot(mpg)+geom_point(mapping = aes(x=displ,y=hwy),color="#EF5C4E",shape=19)
ggplot(mpg)+geom_point(mapping = aes(x=displ,y=hwy,stroke=displ),shape=19)
## 添加兩個(gè)圖層:geom_point,geom_smooth()
ggplot(mpg)+geom_point(mapping = aes(x=displ,y=hwy,color=drv))+geom_smooth(mapping = aes(x=displ,y=hwy,linetype=drv,color=drv))
# 添加分組
ggplot(data = mpg)+geom_smooth(mapping = aes(x=displ,y=hwy,group=drv))
ggplot(data = mpg)+geom_smooth(mapping = aes(x=displ,y=hwy,color=drv),show.legend = F) ## 圖例 show.legend=F
## 在不同的圖層中添加指定不同的數(shù)據(jù)
## data=filter(mpg,class=="suv"), se=F割笙,表示去除f波動(dòng)的范圍权烧。
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point(mapping = aes(color=class))+geom_smooth(data = filter(.data = mpg,class=="suv"))
##exercices
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point()+geom_smooth(se = F)
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point()+geom_smooth(se = F,mapping = aes(group=drv))
ggplot(data = mpg,mapping = aes(x=displ,y=hwy,color=drv))+geom_point()+geom_smooth(se = F)
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point(mapping = aes(color=drv))+geom_smooth(se = F)
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point(mapping = aes(color=drv))+geom_smooth(mapping = aes(linetype=drv),se = F)
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point(mapping = aes(color=drv))
### 統(tǒng)計(jì)變換
ggplot(data = mpg,mapping = aes(x=displ,y=hwy))+geom_point()+geom_smooth(se = F)
ggplot(data=diamonds)+stat_summary(mapping = aes(x=cut,y=depth))
ggplot(data=diamonds)+geom_boxplot(mapping = aes(x=cut,y=price))
ggplot(data=diamonds)+geom_bar(mapping = aes(x=cut))
ggplot(data=diamonds)+geom_bar(mapping = aes(x=cut,y=..prop..),group=2)
### 圖形調(diào)整,位置調(diào)整
ggplot(diamonds)+geom_bar(mapping = aes(x=cut,fill=cut),color="black")+scale_fill_brewer(palette = "Set3")
ggplot(diamonds)+geom_bar(mapping = aes(x=cut,fill=clarity))+scale_fill_brewer(palette = "Set2")
ggplot(diamonds)+geom_bar(mapping=aes(x=cut,color=clarity),position = "dodge")+scale_fill_brewer(palette = "Set2")
ggplot(mpg)+geom_point(mapping = aes(x=displ,y=hwy),position = "jitter")
##exercises
ggplot(mpg,mapping = aes(x=cty,y=hwy))+geom_point(position = "jitter")+geom_smooth(color="black")
ggplot(mpg,mapping = aes(x=cty,y=hwy))+geom_jitter()
ggplot(mpg,mapping = aes(x=cty,y=hwy))+geom_count()
ggplot(mpg)+geom_boxplot(mapping = aes(x=manufacturer,y=hwy),position = "identity")
?geom_boxplot
###1.9 坐標(biāo)系
ggplot(mpg,mapping = aes(x=class,y=hwy))+geom_boxplot()+coord_flip()
nz <- map_data("nz")
?map_data
ggplot(data=diamonds)+geom_bar(mapping = aes(x=cut,fill=cut),show.legend = FALSE)+theme(aspect.ratio = 1)+labs()
bar+scale_color_brewer(palette = "Set2")
bar+coord_flip()
bar+coord_polar()
第二章:工作流:基礎(chǔ) Workflow:basics
- 賦值:小技巧,alt+減號(hào)會(huì)自動(dòng)輸入賦值符號(hào)<- 并在前后加空格
- 對(duì)象:用snake_case命名法小寫(xiě)字母伤溉,以_分割般码。
- Rstudio中快捷查看命令:Alt+Shift+K
第三章:使用dplyr進(jìn)行數(shù)據(jù)轉(zhuǎn)換
- 特殊的data.frame
tibble
。 - 變量variable類(lèi)型:
- int:
- dbl(num的一種乱顾?):雙精度浮點(diǎn)數(shù)變量板祝,或稱(chēng)實(shí)數(shù)。
- chr:字符向量/字符串
- dttm:日期+時(shí)間
- lgl:邏輯型變量
- fctr(factor):因子
- date:日期型變量
- 基礎(chǔ)函數(shù):
filter()
,arrange()
,select()
,mutate()
,summarize()
,group_by()
- 使用
filter()
篩選:- filter(flights, arr_delay<=10).
- 比較運(yùn)算符 ==, !=, <=
- 邏輯運(yùn)算符 x & !y, x|y, xor(x,y)
- 缺失值 NA, is.na()
## filter()
(jan1 <- filter(flights,month==1,day==1))
(dec25 <- filter(flights,month==12,day==25))
filter(flights,month>=11)
(nov_dec <- filter(flights,month %in% c(11,12)))
filter(flights,!(arr_delay<=120 | dep_delay<=120))
NA>=10
x <- NA
is.na(x)
df <- tibble(x=c(1,NA,2))
filter(df,x>1)
filter(df,is.na(x)|x>1)
### exercise
filter(flights,carrier %in% c("UA","AA","DL"))
filter(flights,month %in% c(7,8,9))
filter(flights,arr_delay>120 & dep_delay==0)
filter(flights,dep_delay >= 60 & (dep_delay-arr_delay>=30))
filter(flights,dep_time ==2400| dep_time<=600)
filter(flights,is.na(dep_time))
- 使用
arrange()
按照列(variable)的值進(jìn)行排序-
desc
倒序 - 缺失值排在最后走净,若想提前可
desc(is.na())
-
# arrange()
arrange(flights,desc(dep_delay,arr_delay)) #降序排列
arrange(flights,desc(is.na(dep_delay),arr_delay)) ##將NA值排列到前面
- 使用
select()
選擇列:(數(shù)據(jù)集會(huì)有成千上萬(wàn)個(gè)變量券时,select選出變量的子集)- 選出year~day之間的列:
select(flights, year:day)
- 選出排除year~day列:
select(flights,-(year:dat))
- 匹配變量中的名稱(chēng):
starts_with("")
,ends_with()
,matches("")
-
rename()
對(duì)列重新命名 -
everything()
輔助函數(shù)來(lái)將某幾列提前囊嘉。
- 選出year~day之間的列:
## select()
select(flights,year:day)
select(flights,-(year:day)) ## 不包括year:day
select(flights,starts_with("s"))
select(flights,ends_with("e"))
select(flights,matches("time"))
select(flights,matches("(.)\\1"))
rename(flights,tail_num=tailnum) ##對(duì)變量進(jìn)行重命名
select(flights,-(month:day),everything()) ## 結(jié)合everything()輔助函數(shù) 對(duì)某幾列提前, 置后同理
select(flights, hour:time_hour,everything())
###exercise
select(flights,year,year,year)
select(flights,one_of(c("year","month","day","dep_delay")))
select(flights,contains("TIME"))
- 使用
mutate()
添加新的列/變量:- mutate() 新列添加到已有列的后面;
-
transmute
只保留新的變量革为。 - 常用的運(yùn)算符號(hào):求整%/%扭粱,求余%%,偏移函數(shù)lead(), lag()震檩,累加和和滾動(dòng)聚合琢蛤,邏輯比較,排秩抛虏。
# mutate() 在tibble后 添加新變量/列
flights_sml <- select(flights,year:day,matches("_delay"),distance,air_time)
flights_sml
mutate(flights_sml,flying_delay=arr_delay-dep_delay,speed=distance/air_time * 60 )
flights_sml
transmute(flights,gain=arr_delay-dep_delay,hour=air_time/60,gain_per_hour=gain/hour)
mutate(flights,dep_time=((dep_time%/%100 * 60)+(dep_time%%100))) ## 會(huì)直接在flights中改動(dòng)dep_time
flights
transmute(flights,air_time,duration=(arr_time-dep_time),arr_delay)
1:3+1:10
1:10+1:3
1:10
?cos
- 使用
summarize()
進(jìn)行分組摘要:- 與
group_by
一起使用博其,將整個(gè)數(shù)據(jù)集的單位縮小為單個(gè)分組。
- 與
# 使用summarize()進(jìn)行分組摘要
by_year <- group_by(flights,year,month)
summarise(by_year,delay=mean(arr_delay-dep_delay,na.rm = T))
####查看
(delay_byDay <- group_by(flights,month) %>%summarise(delay_time=mean(dep_delay,na.rm = T))) %>% ggplot(mapping = aes(x=month,y=delay_time))+geom_point()+geom_smooth(se=F)
- 利用管道符
%>%
對(duì)數(shù)據(jù)綜合操作:- 綜合就是
flights %>% group_by(~) %>% summarize(mean(~~,na.rm=T)) %>% filter(~) %>% ggplot(aes())+geom_~()
- 缺失值:
na.rm=T
迂猴,缺失值計(jì)算會(huì)都變成缺失值慕淡,可利用filter(!is.na(dep_delay),!is.na(arr_delay))
- 常用的摘要函數(shù):
n()
/sum()
/mean()
- 中位數(shù)
median()
,分散程度sd()
/IQR()
/mad()
- 計(jì)數(shù)
n()
,計(jì)算唯一值的數(shù)量n_distinct()去重復(fù)后唯一值的計(jì)數(shù)沸毁,count()
可快速的計(jì)算峰髓。 -
邏輯值的計(jì)數(shù) 和 比例:
summarize(n_early=sum(dep_time<50))
,sum找出大于x的True的數(shù)量,mean會(huì)計(jì)算比例息尺。
- 綜合就是
# 使用summarize()進(jìn)行分組摘要
by_year <- group_by(flights,year,month)
summarise(by_year,delay=mean(arr_delay-dep_delay,na.rm = T))
####查看
(delay_byDay <- group_by(flights,month) %>%summarise(delay_time=mean(dep_delay,na.rm = T))) %>% ggplot(mapping = aes(x=month,y=delay_time))+geom_point()+geom_smooth(se=F)
### 使用管道組合多種操作
(delay_by_dest <- group_by(flights,dest)%>%summarise(count=n(),delay_time=mean(dep_time,na.rm = T), dist=mean(distance,na.rm = T))) %>% filter(count>20,dest!="HNL") %>% ggplot(mapping = aes(x=dist,y=delay_time))+geom_point(aes(size=count))+geom_smooth(se=F,color="darkblue")
## 管道符 %>%
(delay <- summarise(by_dest,count=n(),dist=mean(distance,na.rm = T),delay=mean(arr_delay,na.rm = T))) ### count=n()統(tǒng)計(jì)分組携兵,就是dest城市的個(gè)數(shù)
delay <- filter(delay,count>20,dest!="HNL")## 篩掉飛行記錄少的,特殊機(jī)場(chǎng)
ggplot(data = delay,mapping = aes(x=dist,y=delay))+geom_point(aes(size=count),alpha=1/3)+geom_smooth(se=F,color="darkblue")
(delay_by_dest <- group_by(flights,dest)%>%summarise(count=n(),delay_time=mean(dep_time,na.rm = T), dist=mean(distance,na.rm = T))) %>% filter(count>20,dest!="HNL") %>% ggplot(mapping = aes(x=dist,y=delay_time))+geom_point(aes(size=count))+geom_smooth(se=F,color="darkblue")
##查看飛機(jī)型號(hào)與延誤時(shí)間的關(guān)系
flights %>% group_by(tailnum) %>%summarise(count=n(),delay_time=mean(arr_delay,na.rm = T)) %>%arrange(delay_time) %>%ggplot(mapping = aes(x=delay_time))+geom_freqpoly(binwidth = 10)
##查看航班數(shù)量 與 飛機(jī)延誤時(shí)間的關(guān)系:航班數(shù)量少時(shí)搂誉,平均延誤時(shí)間的變動(dòng)特別大
delay_time %>% filter(count>25) %>% ggplot(mapping = aes(x=count,y=delay_time))+geom_point(alpha=1/5)
##其他常用的統(tǒng)計(jì)函數(shù)
flights_not_cancelled %>% group_by(dest) %>% summarise(carrier=n())
flights_not_cancelled %>% group_by(dest) %>% summarise(carriers=n_distinct(carrier))
flights_not_cancelled %>% group_by(tailnum) %>% summarise(sum(distance))
###exercises
##### 查看哪個(gè)航空公司延誤時(shí)間最長(zhǎng)
flights_not_cancelled %>% group_by(carrier) %>% summarise(count=n(),arr_delay_time=mean(arr_delay)) %>% arrange(desc(arr_delay_time)) %>% ggplot(mapping = aes(x=carrier,y=arr_delay_time))+geom_point(aes(size=count))
- 取消分組:
ungroup()
函數(shù):