The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.
一整套數(shù)據(jù)處理的方法包-----包含下面的包:
處理數(shù)據(jù)流程:
- 數(shù)據(jù)導(dǎo)入
- 數(shù)據(jù)整理
- 數(shù)據(jù)探索(可視化董济,統(tǒng)計(jì)分析)
If you’d like to learn how to use the tidyverse effectively, the best place to start is R for data science.
安裝
# Install from CRAN
install.packages("tidyverse")
# Or the development version from GitHub
# install.packages("devtools")
devtools::install_github("tidyverse/tidyverse")
使用
library(tidyverse)will load the core tidyverse packages:
- ggplot2, for data visualisation.
- dplyr, for data manipulation.
- tidyr, for data tidying.
- readr, for data import.
- purrr, for functional programming.
- tibble, for tibbles, a modern re-imagining of data frames.
- stringr, for strings.
- forcats, for factors.
library(tidyverse)
#載入數(shù)據(jù)
library(datasets)
install.packages("gapminder")
library(gapminder)
attach(iris)
#數(shù)據(jù)過濾dplyr
#filter()函數(shù)可以用來取數(shù)據(jù)子集虏肾。
iris %>%
filter(Species == "virginica") # 指定滿足的行
iris %>%
filter(Species == "virginica", Sepal.Length > 6) # 多個(gè)條件用,分隔
#排序
# arrange()函數(shù)用來對(duì)觀察值排序欢搜,默認(rèn)是升序。
iris %>%
arrange(Sepal.Length)
iris %>%
arrange(desc(Sepal.Length)) # 降序
# 新增變量
# mutate()可以更新或者新增數(shù)據(jù)框一列吹埠。
iris %>%
mutate(Sepal.Length = Sepal.Length * 10) # 將該列數(shù)值變成以mm為單位
iris %>%
mutate(SLMn = Sepal.Length * 10) # 創(chuàng)建新的一列
# 整合函數(shù)流:
iris %>%
filter(Species == "Virginica") %>%
mutate(SLMm = Sepal.Length) %>%
arrange(desc(SLMm))
## [1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## [6] SLMm
## <0 行> (或0-長(zhǎng)度的row.names)
# 匯總
# summarize()函數(shù)可以讓我們將很多變量匯總為單個(gè)的數(shù)據(jù)點(diǎn)疮装。
iris %>%
summarize(medianSL = median(Sepal.Length))
## medianSL
## 1 5.8
iris %>%
filter(Species == "virginica") %>%
summarize(medianSL=median(Sepal.Length))
# 一次性匯總多個(gè)變量
iris %>%
filter(Species == "virginica") %>%
summarize(medianSL = median(Sepal.Length),
maxSL = max(Sepal.Length))
# group_by()可以讓我們安裝指定的組別進(jìn)行匯總數(shù)據(jù)廓推,而不是針對(duì)整個(gè)數(shù)據(jù)框
iris %>%
group_by(Species) %>%
summarize(medianSL = median(Sepal.Length),
maxSL = max(Sepal.Length))
iris %>%
filter(Sepal.Length>6) %>%
group_by(Species) %>%
summarize(medianPL = median(Petal.Length),
maxPL = max(Petal.Length))
# ggplot2
# 散點(diǎn)圖
# 散點(diǎn)圖可以幫助我們理解兩個(gè)變量的數(shù)據(jù)關(guān)系,使用geom_point()可以繪制散點(diǎn)圖:
iris_small <- iris %>%
filter(Sepal.Length > 5)
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width)) +
geom_point()
# 顏色
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width,
color = Species)) +
geom_point()
# 大小
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width,
color = Species,
size = Sepal.Length)) +
geom_point()
# 分面
ggplot(iris_small, aes(x = Petal.Length,
y = Petal.Width)) +
geom_point() +
facet_wrap(~Species)
#線圖
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianGdpPerCap = median(gdpPercap))
ggplot(by_year, aes(x = year,
y = medianGdpPerCap)) +
geom_line() +
expand_limits(y=0)
# 條形圖
by_species <- iris %>%
filter(Sepal.Length > 6) %>%
group_by(Species) %>%
summarize(medianPL=median(Petal.Length))
ggplot(by_species, aes(x = Species, y=medianPL)) +
geom_col()
# 直方圖
ggplot(iris_small, aes(x = Petal.Length)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 箱線圖
ggplot(iris_small, aes(x=Species, y=Sepal.Length)) +
geom_boxplot()
參考文章:
http://www.reibang.com/p/f3c21a5ad10a
https://tidyverse.tidyverse.org/
https://zhuanlan.zhihu.com/p/88947457