2020-04-25
1.1. first step --意識(shí)到ggplot繪制其實(shí)是由一層層圖層組成火的,一個(gè)命令即可增加一層
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
ggplot()
creates a coordinate system 坐標(biāo)系
that you can add layers圖層
to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg)
creates an empty graph.1.2. The function geom_point()
adds a layer of points to your plot, which creates a scatterplot.
Themapping
argument is always paired with aes()
, and the x
and y
arguments of aes()
specify which variables to map to the x
and y
axes.
ggplot()--function; geom_point--function 函數(shù); mapping--argument 參數(shù)
增加另一個(gè)數(shù)據(jù)的值:
ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color=Species))
ggplot(data=iris)+geom_point(mapping = aes(x=Species,y=Sepal.Length,color=Sepal.Width))
實(shí)際上命令可疊加
ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,size=Species,color=Species))
Warning message:
Using size for a discrete variable is not advised.
1.3. 還可手動(dòng)設(shè)置對(duì)象的圖形屬性
ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length,color="grey"))
此處坦报,color設(shè)置在
aes()
內(nèi)部上煤,意為:將“grey”這個(gè)字符串賦予color
ggplot(data=iris) + geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),color="grey")
此處休玩,color設(shè)置于
aes()
外部,不改變變量信息劫狠,只是改變geom_point()
散點(diǎn)圖的外觀
One common problem when creating ggplot2 graphics is to put the +
in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:
ggplot(data = mpg)
+ geom_point(mapping = aes(x = displ, y = hwy))
1.4. 還可分面
注意:facet()是和aes()平級(jí)的函數(shù)
ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Species,nrow=2)
注意:species是離散變量拴疤。如果對(duì)連續(xù)變量sepal.width
分面:
> ggplot(data=iris)+geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))+facet_wrap(~Sepal.Width,nrow=4)
對(duì)iris數(shù)據(jù)進(jìn)行統(tǒng)計(jì):
> p<-iris
> distinct(p,iris)
> distinct(p,Sepal.Length) #展示非重復(fù)數(shù)據(jù)
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.4
8 4.8
9 4.3
10 5.8
11 5.7
12 5.2
13 5.5
14 4.5
15 5.3
16 7.0
17 6.4
18 6.9
19 6.5
20 6.3
21 6.6
22 5.9
23 6.0
24 6.1
25 5.6
26 6.7
27 6.2
28 6.8
29 7.1
30 7.6
31 7.3
32 7.2
33 7.7
34 7.4
35 7.9
> count(p,Sepal.Length) #統(tǒng)計(jì)非重復(fù)數(shù)據(jù)
# A tibble: 35 x 2
Sepal.Length n
<dbl> <int>
1 4.3 1
2 4.4 3
3 4.5 1
4 4.6 4
5 4.7 2
6 4.8 5
7 4.9 6
8 5 10
9 5.1 9
10 5.2 4
# … with 25 more rows
1.5. 比較facet_grid() 一般需要將具有更多唯一值的變量放在列上
ggplot(data=mpg)+
geom_point(mapping = aes(drv,y=cyl))
> ggplot(data=mpg)+
+ geom_point(mapping = aes(drv,y=cyl))+
+ facet_grid(drv~cyl)
ggplot(data=mpg)+
+ geom_point(mapping = aes(drv,y=cyl))+
+ facet_grid(cyl~drv)
> ggplot(data=mpg)+
+ geom_point(mapping = aes(drv,y=cyl))+
+ facet_grid(drv~.)
- 關(guān)于stroke
ggplot(data=iris)+
+ geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width,stroke=1,fill="lightpink",color=Species),shape=21)
放大可見描邊內(nèi)部形狀填充了lightpink
1.6. 幾何對(duì)象
> ggplot(data=iris)+
+ geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width,color=Species))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
>
將相同對(duì)象納入不同命令處理時(shí),可以這樣:
> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+ geom_point()+
+ geom_smooth()
(當(dāng)然最基本函數(shù)是這樣:)
> ggplot(data = iris)+
+ geom_point(mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+ geom_smooth(mapping = aes(x=Sepal.Length,y=Sepal.Width))
還可以單獨(dú)對(duì)某一函數(shù)施加命令:
> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+ geom_point(mapping = aes(color=Species))+
+ geom_smooth()
同理独泞,可以對(duì)不同圖層施加不同數(shù)據(jù):局部可以覆蓋全局
> ggplot(data = iris, mapping = aes(x=Sepal.Length,y=Sepal.Width))+
+ geom_point(mapping = aes(color=Species),show.legend = F)+
+ geom_smooth(data=filter(iris,Species=="setosa"))
思考題
p1 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(size = 2.5) +
geom_smooth(se = F, size = 1.5)
p2 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(size = 2.5) +
geom_smooth(se = F, size = 1.5, mapping = aes(group = drv))
p3 <- ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
geom_point(size = 2.5) +
geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, color = drv))
p4 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(size = 2.5, mapping = aes(color = drv)) +
geom_smooth(se = F, size = 1.5)
p5 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(size = 2.5, mapping = aes(color = drv)) +
geom_smooth(se = F, size = 1.5, mapping = aes(group = drv, linetype = drv))
p6 <- ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(size = 2.5, mapping = aes(color = drv))
library(gridExtra) #把幾張圖排到一起
grid.arrange(p1, p2, p3, p4, p5, p6, ncol= 2, nrow = 3)
1.7. 統(tǒng)計(jì)變換
geom_bar
view(diamonds)
geom_bar的默認(rèn)統(tǒng)計(jì)變換是stat_count呐矾,stat_count會(huì)計(jì)算出兩個(gè)新變量-count
(計(jì)數(shù))和prop
(proportions,比例)懦砂。
直方圖默認(rèn)的y軸是x軸的計(jì)數(shù)蜒犯。此例子中x軸是是五種cut(切割質(zhì)量),直方圖自動(dòng)統(tǒng)計(jì)了這五種質(zhì)量的鉆石的統(tǒng)計(jì)計(jì)數(shù)荞膘,當(dāng)你不想使用計(jì)數(shù)罚随,而是想顯示各質(zhì)量等級(jí)所占比例的時(shí)候就需要用到prop
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
group=1
的意思是把所有鉆石作為一個(gè)整體,顯示五種質(zhì)量的鉆石所占比例體現(xiàn)出來羽资。如果不加這一句淘菩,就是每種質(zhì)量的鉆石各為一組來計(jì)算,那么比例就都是100%屠升,
> ggplot(data = diamonds) +
+ stat_summary(
+ mapping = aes(x = cut, y = depth),
+ fun.min = min,
+ fun.max = max,
+ fun = median
+ )
stat_summary(
mapping = NULL,
data = NULL,
geom = "pointrange", #`stat_summary`默認(rèn)幾何對(duì)象
position = "identity", #`geom_pointrange`的默認(rèn)統(tǒng)計(jì)變換潮改,二者不可逆
因此,對(duì)于stat_summary腹暖,如果不適用該統(tǒng)計(jì)變換函數(shù)汇在,而使用幾何對(duì)象函數(shù):
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary"
)
(本圖未加error bar)
geom_col針對(duì)最常見的柱狀圖 ,即既給ggplot映射x值(x值一般是因子型的變量微服,才能成為柱趾疚,而沒有成為曲線)缨历,也映射y值。
如: ggplot2(data, aes(x = x, y = y)) +
geom_col()geom_bar針對(duì)計(jì)數(shù)的柱狀圖糙麦,即count, 是只給ggplot映射x值(x也一般是因子)辛孵。自動(dòng)計(jì)算x的每個(gè)因子所擁有的數(shù)據(jù)點(diǎn)的個(gè)數(shù),將這個(gè)個(gè)數(shù)給與y軸赡磅。
區(qū)別在于給ggplot是否映射y值魄缚。
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
Complementary geoms and stats
geom | stat |
---|---|
geom_bar() | stat_count() |
geom_bin2d() | stat_bin_2d() |
geom_boxplot() | stat_boxplot() |
geom_contour() | stat_contour() |
geom_count() | stat_sum() |
geom_density() | stat_density() |
geom_density_2d() | stat_density_2d() |
geom_hex() | stat_hex() |
geom_freqpoly() | stat_bin() |
geom_histogram() | stat_bin() |
geom_qq_line() | stat_qq_line() |
geom_qq() | stat_qq() |
geom_quantile() | stat_quantile() |
geom_smooth() | stat_smooth() |
geom_violin() | stat_violin() |
geom_sf() | stat_sf() |
geom_pointrange() | stat_identity() |
They tend to have their names in common, stat_smooth() and geom_smooth(). However, this is not always the case, with geom_bar() and stat_count() and geom_histogram() and geom_bin() as notable counter-examples.
If you want the heights of the bars to represent values in the data, use geom_col()
instead. geom_bar()
uses stat_count()
by default: it counts the number of cases at each x position. geom_col()
uses stat_identity(): it leaves the data as is.
ggplot2 geom layers and their default stats
geom | default stat |
---|---|
geom_abline() | - |
geom_hline() | - |
geom_vline() | - |
geom_bar() | stat_count() |
geom_col() | - |
geom_bin2d() | stat_bin_2d() |
geom_blank() | - |
geom_boxplot() | stat_boxplot() |
geom_countour() | stat_countour() |
geom_count() | stat_sum() |
geom_density() | stat_density() |
geom_density_2d() | stat_density_2d() |
geom_dotplot() | - |
geom_errorbarh() | - |
geom_hex() | stat_hex() |
geom_freqpoly() | stat_bin() x |
geom_histogram() | -stat_bin() x |
geom_crossbar() | - |
geom_errorbar() | - |
geom_linerange() | - |
geom_pointrange() | - |
geom_map() | - |
geom_point() | - |
geom_map() | - |
geom_path() | - |
geom_line() | - |
geom_step() | - |
geom_point() | - |
geom_polygon() | - |
geom_qq_line() | stat_qq_line() x |
geom_qq() | stat_qq() x |
geom_quantile() | stat_quantile() x |
geom_ribbon() | - |
geom_area() | - |
geom_rug() | - |
geom_smooth() | stat_smooth() x |
geom_spoke() | - |
geom_label() | - |
geom_text() | - |
geom_raster() | - |
geom_rect() | - |
geom_tile() | - |
geom_violin() | stat_ydensity() x |
geom_sf() | stat_sf() x |
ggplot2 stat layers and their default geoms
stat | default geom |
---|---|
stat_ecdf() | geom_step() |
stat_ellipse() | geom_path() |
stat_function() | geom_path() |
stat_identity() | geom_point() |
stat_summary_2d() | geom_tile() |
stat_summary_hex() | geom_hex() |
stat_summary_bin() | geom_pointrange() |
stat_summary() | geom_pointrange() |
stat_unique() | geom_point() |
stat_count() | geom_bar() |
stat_bin_2d() | geom_tile() |
stat_boxplot() | geom_boxplot() |
stat_countour() | geom_contour() |
stat_sum() | geom_point() |
stat_density() | geom_area() |
stat_density_2d() | geom_density_2d() |
stat_bin_hex() | geom_hex() |
stat_bin() | geom_bar() |
stat_qq_line() | geom_path() |
stat_qq() | geom_point() |
stat_quantile() | geom_quantile() |
stat_smooth() | geom_smooth() |
stat_ydensity() | geom_violin() |
stat_sf() | geom_rect() |
關(guān)于geom_smooth
:有3個(gè)回歸函數(shù)
glm
是廣義線性回歸函數(shù),當(dāng)然你也可以用它來做線性回歸
lm
是線性回歸函數(shù),不能擬合廣義線性回歸模型
loess
>p1<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+ geom_point() +
+ geom_smooth( method = glm,se=FALSE)
> p2<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+ geom_point() +
+ geom_smooth( method = lm,se=FALSE)
> p3<-ggplot(mpg, aes(displ, hwy, colour = class)) +
+ geom_point() +
+ geom_smooth( method = loess,se=FALSE)
library(gridExtra)
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)
關(guān)于group=1
> p1=ggplot(data = diamonds) +
+ geom_bar(mapping = aes(x = cut, y = ..prop..))
> p2=ggplot(data = diamonds) +
+ geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
> p3=ggplot(data = diamonds) +
+ geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..,group=1))
> grid.arrange(p1,p2,p3,ncol=2,nrow=2)
因?yàn)榭v軸是..prop..,即分類變量中每個(gè)類別占總量的比焚廊,group=1就是將這些類別當(dāng)作一組的這樣一個(gè)整體去分別計(jì)算各個(gè)類別的占比冶匹,所以須有g(shù)roup=1。
否則咆瘟,默認(rèn)的就是各個(gè)類別各自一個(gè)“組”嚼隘,在計(jì)數(shù)時(shí)就是普通的條形圖,而在計(jì)算占比時(shí)每個(gè)類別都是百分百占比袒餐,所以每個(gè)條形圖都是頂頭的一樣高飞蛹。既第一條代碼所畫的圖片。
若是還有填充的映射灸眼,如fill=color卧檐,則每種顏色代表的color的一個(gè)分類在每個(gè)條形圖中都是高度為1,7種顏色堆疊在一起焰宣,縱坐標(biāo)的頂頭都是7霉囚。既第二條代碼所畫的圖片。
作者:咕嚕咕嚕轉(zhuǎn)的ATP合酶
鏈接:http://www.reibang.com/p/f36c3f8cfb24
1.8 位置變換
ggplot(data=iris)+
+ geom_bar(mapping = aes(x=Sepal.Width,y=Sepal.Length,fill=Species),stat="identity")
> p1=ggplot(data=iris)+
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5,position = "identity")
> p2=ggplot(data=iris)+
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),alpha=3/5)
> p3=ggplot(data=iris)+
+ geom_bar(mapping = aes(x=Sepal.Width,color=Species),fill=NA,position = "identity")
> p4=ggplot(data=iris)+
+ geom_bar(mapping = aes(x=Sepal.Width,fill=Species),position = "fill")
> grid.arrange(p1,p2,p3,p4,ncol=2,nrow=2)
p5=ggplot(data = iris)+
+ geom_bar(mapping=aes(x=Sepal.Width,fill=Species),position="dodge")
> grid.arrange(p1,p2,p3,p4,p5,ncol=2,nrow=3)
關(guān)于“過繪制”:
默認(rèn)取整,因此部分重疊的點(diǎn)未能顯示
p6=ggplot(data=iris)+
+ geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length),position="jitter")
> p7=ggplot(data=iris)+
+ geom_point(mapping = aes(x=Sepal.Width,y=Sepal.Length))
> grid.arrange(p6,p7,ncol=2,nrow=2)
ggplot(data=iris,mapping = aes(x=Sepal.Width,y=Sepal.Length))+
+ geom_jitter()
也可以生成相同結(jié)果
微調(diào)jitter
p8=ggplot(data=mpg,mapping=aes(x=cty,y=hwy))+
+ geom_jitter(aes(color=class))
>p <- ggplot(mpg, aes(cyl, hwy))
p9 <- p+geom_jitter(aes(color=class))
> grid.arrange(p8,p9,ncol=2,nrow=2)
p10=ggplot(data=mpg,mapping=aes(x=cyl,y=hwy))+
+ geom_jitter(aes(color=class))
> grid.arrange(p8,p9,p10,ncol=2,nrow=2)
Compare and contrast
geom_jitter()
withgeom_count()
.
The geom geom_jitter()
adds random variation to the locations points of the graph. In other words, it “jitters” the locations of points slightly. This method reduces overplotting since two points with the same location are unlikely to have the same random variation.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
However, the reduction in overlapping comes at the cost of slightly changing the x
and y
values of the points.
The geom geom_count()
sizes the points relative to the number of observations. Combinations of (x
, y
) values with more observations will be larger than those with fewer observations.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
The geom_count()
geom does not change x
and y
coordinates of the points. However, if the points are close together and counts are large, the size of some points can itself create overplotting. For example, in the following example, a third variable mapped to color is added to the plot. In this case, geom_count()
is less readable than geom_jitter()
when adding a third variable as a color aesthetic.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_count()
As that example shows,unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.
1.9 坐標(biāo)系
coord_flip
--置換X Y軸
coord_quickmap
--為地圖選擇合適縱橫比
coord_polar
--極坐標(biāo)系
usa<-map_data("usa")
nz<-map_data("nz")
ggplot(usa, aes(long, lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
coord_quickmap()
ggplot(iris, aes(x = factor(1), fill = Species)) +
geom_bar()
ggplot(iris, aes(x = factor(1), fill = Species)) +
geom_bar(width = 1) +
coord_polar(theta = "y")
The argument theta = "y" maps y to the angle of each section. If coord_polar() is specified without theta = "y", then the resulting plot is called a bulls-eye chart.