1.人見人愛 tidyverse(大包)

tidyr dplyr stringr ggplot2 (小包，可單獨安裝)

tidyr

gather-spread # 列的聚集與分散
separate-unite # 列的拆分與合并
處理缺失值：drop_na，replace_na，fill

dplyr 最核心，專門處理數(shù)據框

基礎

mutate(), 新增列
select(), 按列篩選
filter(), 按行篩選
arrange(), 按某一行對數(shù)據框進行排序
summarise(), 匯總

進階
count()
管道符號 %>% (ctr+shift+M) ：上一步的輸出作為下一步的輸入

tidyr

在tidyverse的世界掌唾，包括ggplot，不考慮行名
數(shù)據清理，tidydata每個變量（variable）占一列免钻，每個觀測（observation）占一行。
R語言 tidyr包的三個重要函數(shù)：gather崔拥，spread极舔，separate的用法和舉例
https://blog.csdn.net/six66667/article/details/84888644

一、數(shù)據清理

rm(list = ls())
options(stringsAsFactors = F)
if(!require(tidyr))install.packages("tidyr")
### 原始數(shù)據

test <- data.frame(geneid = paste0("gene",1:4),
                 sample1 = c(1,4,7,10),
                 sample2 = c(2,5,0.8,11),
                 sample3 = c(0.3,6,9,12))
test

扁變長

test_gather <- gather(data = test, #原數(shù)據链瓦，要是數(shù)據框
                    key = sample_nm,
                    value = exp, #新的列名拆魏，一個鍵值對。就是把原變量名（屬性名）做鍵（key）慈俯，變量值做值（value）渤刃。
                    - geneid) #要合并哪些列，參數(shù)不寫默認全部轉置贴膘。- geneid表示除去geneid列卖子，只轉置剩下三列。
head(test_gather)

 geneid sample_nm exp
1  gene1   sample1   1
2  gene2   sample1   4
3  gene3   sample1   7
4  gene4   sample1  10
5  gene1   sample2   2
6  gene2   sample2   5

長變扁

#spread用來擴展表刑峡，把某一列的值（鍵值對）分開拆成多列洋闽。
#spread(data, key, value, fill = NA, convert = FALSE, drop =TRUE, sep = NULL)
#key是原來要拆的那一列的名字（變量名）玄柠，value是拆出來的那些列的值應該填什么（填原表的哪一列）
test_re <- spread(data = test_gather,
                key = sample_nm,#要拆分的那一列的列名
                value = exp) #擴展出的列的值應該來自原表的哪一列的列名
head(test_re)

 geneid sample1 sample2 sample3
1  gene1       1     2.0     0.3
2  gene2       4     5.0     6.0
3  gene3       7     0.8     9.0
4  gene4      10    11.0    12.0

二、分割和合并

原始數(shù)據

> ### 原始數(shù)據
> test <- data.frame(x = c( "a,b", "a,d", "b,c"));test
    x
1 a,b
2 a,d
3 b,c

分割

#變量名诫舅，原列名羽利，新列名，分隔符
> test_seprate <- separate(test,x, c("X", "Y"),sep = ",");test_seprate
  X Y
1 a b
2 a d
3 b c

合并

#data：為數(shù)據框
col：被組合的新列名稱
…：指定哪些列需要被組合
sep：組合列之間的連接符骚勘，默認為下劃線
> test_re <- unite(test_seprate,"x",X,Y,sep = ",");test_re
    x
1 a,b
2 a,d
3 b,c

三铐伴、處理NA

原始數(shù)據

> X<-data.frame(X1 = LETTERS[1:5],X2 = 1:5)
> X[2,2] <- NA
> X[4,1] <- NA
> X
    X1 X2
1    A  1
2    B NA
3    C  3
4 <NA>  4
5    E  5

1.去掉含有NA的行,可以選擇只根據某一列來去除

> drop_na(X)#去掉所有含有NA的行。和na.omit一樣俏讹。
  X1 X2
1  A  1
2  C  3
3  E  5
> drop_na(X,X1)#只看x1列的NA当宴，去掉所有含有NA的行
  X1 X2
1  A  1
2  B NA
3  C  3
4  E  5
> drop_na(X,X[2,])#錯，不能按行除
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type `data.frame<
  X1: character
  X2: integer
>`.
i It must be numeric or character.
Run `rlang::last_error()` to see where the error occurred.

2.替換NA

> replace_na(X$X2,0)#把x2列的NA 替換成0
[1] 1 0 3 4 5
> replace_na(X,0)#錯泽疆，不能全部替換
Error in replace_na.data.frame(X, 0) : is_list(replace) is not TRUE

3.用上一行的值填充NA

> fill(X,X2)#按照上一行的內容填充缺失值
    X1 X2
1    A  1
2    B  1
3    C  3
4 <NA>  4
5    E  5
> fill(X)#無效户矢，不能選全部
    X1 X2
1    A  1
2    B NA
3    C  3
4 <NA>  4
5    E  5

完整操作，查看小抄https://www.rstudio.com/resources/cheatsheets/

dplyr

rm(list = ls())

## 包和數(shù)據的準備
if(!require(dplyr))install.packages("dplyr")
library(dplyr)
test <- iris[c(1:2,51:52,101:102),]
rownames(test) =NULL

五個基礎函數(shù)

1.mutate(),新增列

> mutate(test, new = Sepal.Length * Sepal.Width)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species   new
1          5.1         3.5          1.4         0.2     setosa 17.85
2          4.9         3.0          1.4         0.2     setosa 14.70
3          7.0         3.2          4.7         1.4 versicolor 22.40
4          6.4         3.2          4.5         1.5 versicolor 20.48
5          6.3         3.3          6.0         2.5  virginica 20.79
6          5.8         2.7          5.1         1.9  virginica 15.66

2.select(),按列篩選

(1)按列號篩選

> select(test,1)
  Sepal.Length
1          5.1
2          4.9
3          7.0
4          6.4
5          6.3
6          5.8
> select(test,c(1,5))
  Sepal.Length    Species
1          5.1     setosa
2          4.9     setosa
3          7.0 versicolor
4          6.4 versicolor
5          6.3  virginica
6          5.8  virginica

(2)按列名篩選

> select(test,Sepal.Length)
  Sepal.Length
1          5.1
2          4.9
3          7.0
4          6.4
5          6.3
6          5.8
> select(test, Petal.Length, Petal.Width)
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          4.7         1.4
4          4.5         1.5
5          6.0         2.5
6          5.1         1.9
> vars <- c("Petal.Length", "Petal.Width")
> select(test, one_of(vars))#one_of('x','y','z')#選擇包含在聲明變量中的
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          4.7         1.4
4          4.5         1.5
5          6.0         2.5
6          5.1         1.9

(3)一組來自tidyselect的有用函數(shù)

select(test, starts_with("Petal"))

##   Petal.Length Petal.Width
## 1          1.4         0.2
## 2          1.4         0.2
## 3          4.7         1.4
## 4          4.5         1.5
## 5          6.0         2.5
## 6          5.1         1.9

select(test, ends_with("Width"))

##   Sepal.Width Petal.Width
## 1         3.5         0.2
## 2         3.0         0.2
## 3         3.2         1.4
## 4         3.2         1.5
## 5         3.3         2.5
## 6         2.7         1.9

select(test, contains("etal"))

##   Petal.Length Petal.Width
## 1          1.4         0.2
## 2          1.4         0.2
## 3          4.7         1.4
## 4          4.5         1.5
## 5          6.0         2.5
## 6          5.1         1.9

select(test, matches(".t."))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          7.0         3.2          4.7         1.4
## 4          6.4         3.2          4.5         1.5
## 5          6.3         3.3          6.0         2.5
## 6          5.8         2.7          5.1         1.9

select(test, everything())

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          4.9         3.0          1.4         0.2     setosa
## 3          7.0         3.2          4.7         1.4 versicolor
## 4          6.4         3.2          4.5         1.5 versicolor
## 5          6.3         3.3          6.0         2.5  virginica
## 6          5.8         2.7          5.1         1.9  virginica

select(test, last_col())

##      Species
## 1     setosa
## 2     setosa
## 3 versicolor
## 4 versicolor
## 5  virginica
## 6  virginica

select(test, last_col(offset = 1)) # offset 就是 ncol - 1

##   Petal.Width
## 1         0.2
## 2         0.2
## 3         1.4
## 4         1.5
## 5         2.5
## 6         1.9

(4)利用everything()殉疼，列名可以重排序

select(test,Species,everything())

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa          5.1         3.5          1.4         0.2
## 2     setosa          4.9         3.0          1.4         0.2
## 3 versicolor          7.0         3.2          4.7         1.4
## 4 versicolor          6.4         3.2          4.5         1.5
## 5  virginica          6.3         3.3          6.0         2.5
## 6  virginica          5.8         2.7          5.1         1.9

3.filter()篩選行

> filter(test, Species == "setosa")
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
> filter(test, Species == "setosa"&Sepal.Length > 5 )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
> filter(test, Species %in% c("setosa","versicolor"))
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          4.9         3.0          1.4         0.2     setosa
3          7.0         3.2          4.7         1.4 versicolor
4          6.4         3.2          4.5         1.5 versicolor

4.arrange(),按某一列對整個表格進行排序

> arrange(test, Sepal.Length)#默認從小到大排序
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          4.9         3.0          1.4         0.2     setosa
2          5.1         3.5          1.4         0.2     setosa
3          5.8         2.7          5.1         1.9  virginica
4          6.3         3.3          6.0         2.5  virginica
5          6.4         3.2          4.5         1.5 versicolor
6          7.0         3.2          4.7         1.4 versicolor
> arrange(test, desc(Sepal.Length))#用desc從大到小
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          7.0         3.2          4.7         1.4 versicolor
2          6.4         3.2          4.5         1.5 versicolor
3          6.3         3.3          6.0         2.5  virginica
4          5.8         2.7          5.1         1.9  virginica
5          5.1         3.5          1.4         0.2     setosa
6          4.9         3.0          1.4         0.2     setosa
> arrange(test,  desc(Sepal.Width),Sepal.Length)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          6.3         3.3          6.0         2.5  virginica
3          6.4         3.2          4.5         1.5 versicolor
4          7.0         3.2          4.7         1.4 versicolor
5          4.9         3.0          1.4         0.2     setosa
6          5.8         2.7          5.1         1.9  virginica

基礎包用order實現(xiàn)跟arrange一樣的操作

> library(dplyr)
> test = iris[c(1,2,51,52,101,102),]
> test
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
> rownames(test) = NULL
> arrange(test,Sepal.Length,Sepal.Width)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          4.9         3.0          1.4         0.2     setosa
2          5.1         3.5          1.4         0.2     setosa
3          5.8         2.7          5.1         1.9  virginica
4          6.3         3.3          6.0         2.5  virginica
5          6.4         3.2          4.5         1.5 versicolor
6          7.0         3.2          4.7         1.4 versicolor
> o = order(test$Sepal.Length)#返回下標
> test$Sepal.Length[o]#相當于進行了sort
[1] 4.9 5.1 5.8 6.3 6.4 7.0

x[order(x)]
sort(x)#x[order(x)]就等于sort(x)
#order(x)不僅可以用于給x列排序梯浪，還可以給位于一個數(shù)據框的其他列或行名進行排序，也可以對整個數(shù)據框排序

> test[o,]#某一列的下標就是行號
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
2          4.9         3.0          1.4         0.2     setosa
1          5.1         3.5          1.4         0.2     setosa
6          5.8         2.7          5.1         1.9  virginica
5          6.3         3.3          6.0         2.5  virginica
4          6.4         3.2          4.5         1.5 versicolor
3          7.0         3.2          4.7         1.4 versicolor
#和arrange作用一樣

5.summarise()：匯總通常結合分組一起使用

對數(shù)據進行匯總操作,結合group_by使用實用性強

> summarise(test, mean(Sepal.Length), sd(Sepal.Length))# 計算Sepal.Length的平均值和標準差：
  mean(Sepal.Length) sd(Sepal.Length)
1           5.916667        0.8084965
> #dplyr里的函數(shù)不寫引號瓢娜，$
> # 先按照Species分組挂洛，計算每組Sepal.Length的平均值和標準差
> group_by(test, Species)
# A tibble: 6 x 5
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5.1         3.5          1.4         0.2 setosa    
2          4.9         3            1.4         0.2 setosa    
3          7           3.2          4.7         1.4 versicolor
4          6.4         3.2          4.5         1.5 versicolor
5          6.3         3.3          6           2.5 virginica 
6          5.8         2.7          5.1         1.9 virginica 
> tmp = summarise(group_by(test, Species),mean(Sepal.Length), sd(Sepal.Length))
`summarise()` ungrouping output (override with `.groups` argument)
> tmp
# A tibble: 3 x 3
  Species    `mean(Sepal.Length)` `sd(Sepal.Length)`
  <fct>                     <dbl>              <dbl>
1 setosa                     5                 0.141
2 versicolor                 6.7               0.424
3 virginica                  6.05              0.354

兩個實用技能

1：管道操作 %>% (cmd/ctr + shift + M)：上一步的輸出作為下一步的輸入

管道操作并不是只存在于dplyr
這兩種方法結果一樣

library(dplyr)
x1 = filter(iris,Sepal.Width>3)
x2 = select(x1,c("Sepal.Length","Sepal.Width" ))
x3 = arrange(x2,Sepal.Length)

x=iris %>% 
  filter(Sepal.Width>3) %>% 
  select(c("Sepal.Length","Sepal.Width" ))%>%
  arrange(Sepal.Length)
#省掉2步中間賦值
#最終結果儲存在x

2：count統(tǒng)計某列的unique值

count和table比較,輸出結果是數(shù)據框,接受參數(shù)是數(shù)據框

> count(test,Species)
     Species n
1     setosa 2
2 versicolor 2
3  virginica 2
> #輸出結果是數(shù)據框,規(guī)范
> table(test$Species)

    setosa versicolor  virginica 
         2          2          2 
> class(table(test$Species))
[1] "table"
#和table比較,table輸出數(shù)據類型是table

##      Species n
## 1     setosa 2
## 2 versicolor 2
## 3  virginica 2

處理關系數(shù)據:即將2個表進行連接，注意：不要引入factor

dplyr中一組-join結尾函數(shù)用來取不同集合最重要：全連接眠砾，取交集

options(stringsAsFactors = F)

test1 <- data.frame(name = c('jimmy','nicker','doodle'), 
                    blood_type = c("A","B","O"))
test1

##     name blood_type
## 1  jimmy          A
## 2 nicker          B
## 3 doodle          O

test2 <- data.frame(name = c('doodle','jimmy','nicker','tony'),
                    group = c("group1","group1","group2","group2"),
                    vision = c(4.2,4.3,4.9,4.5))
test2

##     name  group vision
## 1 doodle group1    4.2
## 2  jimmy group1    4.3
## 3 nicker group2    4.9
## 4   tony group2    4.5

test3 <- data.frame(NAME = c('doodle','jimmy','lucy','nicker'),
                    weight = c(140,145,110,138))
test3

##     NAME weight
## 1 doodle    140
## 2  jimmy    145
## 3   lucy    110
## 4 nicker    138

merge(test1,test2,by="name")

##     name blood_type  group vision
## 1 doodle          O group1    4.2
## 2  jimmy          A group1    4.3
## 3 nicker          B group2    4.9

merge(test1,test3,by.x = "name",by.y = "NAME")

##     name blood_type weight
## 1 doodle          O    140
## 2  jimmy          A    145
## 3 nicker          B    138

1.內連inner_join,取交集

inner_join(test1, test2, by = "name")

##     name blood_type  group vision
## 1  jimmy          A group1    4.3
## 2 nicker          B group2    4.9
## 3 doodle          O group1    4.2

inner_join(test1,test3,by = c("name"="NAME"))

##     name blood_type weight
## 1  jimmy          A    145
## 2 nicker          B    138
## 3 doodle          O    140

2.左連left_join

left_join(test1, test2, by = 'name')

##     name blood_type  group vision
## 1  jimmy          A group1    4.3
## 2 nicker          B group2    4.9
## 3 doodle          O group1    4.2

left_join(test2, test1, by = 'name')

##     name  group vision blood_type
## 1 doodle group1    4.2          O
## 2  jimmy group1    4.3          A
## 3 nicker group2    4.9          B
## 4   tony group2    4.5       <NA>

3.全連full_join

full_join(test1, test2, by = 'name')

##     name blood_type  group vision
## 1  jimmy          A group1    4.3
## 2 nicker          B group2    4.9
## 3 doodle          O group1    4.2
## 4   tony       <NA> group2    4.5

4.半連接：返回能夠與y表匹配的x表所有記錄semi_join

semi_join(x = test1, y = test2, by = 'name')

##     name blood_type
## 1  jimmy          A
## 2 nicker          B
## 3 doodle          O

5.反連接：返回無法與y表匹配的x表的所記錄anti_join

anti_join(x = test2, y = test1, by = 'name')

##   name  group vision
## 1 tony group2    4.5

6.數(shù)據的簡單合并

在相當于base包里的cbind()函數(shù)和rbind()函數(shù);注意虏劲，bind_rows()函數(shù)需要兩個表格列數(shù)相同，而bind_cols()函數(shù)則需要兩個數(shù)據框有相同的行數(shù)

test1 <- data.frame(x = c(1,2,3,4), y = c(10,20,30,40))
test1

##   x  y
## 1 1 10
## 2 2 20
## 3 3 30
## 4 4 40

test2 <- data.frame(x = c(5,6), y = c(50,60))
test2

##   x  y
## 1 5 50
## 2 6 60

test3 <- data.frame(z = c(100,200,300,400))
test3

##     z
## 1 100
## 2 200
## 3 300
## 4 400

bind_rows(test1, test2)

##   x  y
## 1 1 10
## 2 2 20
## 3 3 30
## 4 4 40
## 5 5 50
## 6 6 60

bind_cols(test1, test3)

##   x  y   z
## 1 1 10 100
## 2 2 20 200
## 3 3 30 300
## 4 4 40 400

練習6-1

1.將iris數(shù)據框的前4列gather褒颈，然后還原

tmp <- iris
tmp_gather <- tmp %>% 
  gather(key = bioinformation, value = number, -Species)
head(tmp_gather)

##   Species bioinformation number
## 1  setosa   Sepal.Length    5.1
## 2  setosa   Sepal.Length    4.9
## 3  setosa   Sepal.Length    4.7
## 4  setosa   Sepal.Length    4.6
## 5  setosa   Sepal.Length    5.0
## 6  setosa   Sepal.Length    5.4

tmp_re <- tmp_gather %>%
  group_by(bioinformation) %>% 
  mutate(id=1:n()) %>%
  spread(bioinformation,number)
head(tmp_re)

## # A tibble: 6 x 6
##   Species    id Petal.Length Petal.Width Sepal.Length Sepal.Width
##   <fct>   <int>        <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa      1          1.4         0.2          5.1         3.5
## 2 setosa      2          1.4         0.2          4.9         3  
## 3 setosa      3          1.3         0.2          4.7         3.2
## 4 setosa      4          1.5         0.2          4.6         3.1
## 5 setosa      5          1.4         0.2          5           3.6
## 6 setosa      6          1.7         0.4          5.4         3.9

小潔老師本人的解答

2.將第二列分成兩列（以小數(shù)點為分隔符）然后合并

#### 點號表示任意字符
x=separate(test,
           Sepal.Width,
           into = c('a','b'),
           sep = "\\.")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [2].

x$b <- replace_na(x$b,0);x

##   Sepal.Length a b Petal.Length Petal.Width    Species
## 1          5.1 3 5          1.4         0.2     setosa
## 2          4.9 3 0          1.4         0.2     setosa
## 3          7.0 3 2          4.7         1.4 versicolor
## 4          6.4 3 2          4.5         1.5 versicolor
## 5          6.3 3 3          6.0         2.5  virginica
## 6          5.8 2 7          5.1         1.9  virginica

x_re=unite(x,
           "Sepal.Width",
           a,b,sep = ".");x_re

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          4.9         3.0          1.4         0.2     setosa
## 3          7.0         3.2          4.7         1.4 versicolor
## 4          6.4         3.2          4.5         1.5 versicolor
## 5          6.3         3.3          6.0         2.5  virginica
## 6          5.8         2.7          5.1         1.9  virginica

x_re$Sepal.Width <- as.numeric(x_re$Sepal.Width)
str(x_re)

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 7 6.4 6.3 5.8
##  $ Sepal.Width : num  3.5 3 3.2 3.2 3.3 2.7
##  $ Petal.Length: num  1.4 1.4 4.7 4.5 6 5.1
##  $ Petal.Width : num  0.2 0.2 1.4 1.5 2.5 1.9
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 2 2 3 3

stringr

rm(list = ls())
if(!require(stringr))install.packages('stringr')
library(stringr)#不能刪掉這一句柒巫。保險，如果有包谷丸，確保加載堡掏；沒包，只安裝刨疼，沒加載泉唁。

x <- "The birch canoe slid on the smooth planks."#這是一個字符串

1.檢測字符串長度

> ###1.檢測字符串長度
> length(x)
[1] 1
> length(x) #向量的長度，表示向量里有幾個元素
[1] 1
> str_length(x)#一共有多少個字符揩慕，空格也算游两。向量里的每個元素有多少個字符。
[1] 42

2.字符串拆分與組合

> str_split(x," ")#按照空格對x拆分
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."

> x2 = str_split(x," ")[[1]]#列表取子集
> #x不僅可以是單個字符串漩绵，還可以是多個字符串組成的向量
> y=sentences[1:3]
> str_split(y," ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"         "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

> #有3個元素的列表
> y2=str_split(y," ",simplify = T)#simplify = T,簡化贱案，把列表簡化為矩陣
> View(y2)
> y2
     [,1]   [,2]    [,3]    [,4]   [,5]  [,6]    [,7]     [,8]          [,9]   
[1,] "The"  "birch" "canoe" "slid" "on"  "the"   "smooth" "planks."     ""     
[2,] "Glue" "the"   "sheet" "to"   "the" "dark"  "blue"   "background." ""     
[3,] "It's" "easy"  "to"    "tell" "the" "depth" "of"     "a"           "well."
#變?yōu)榫仃嚭螅痰淖兊煤烷L的一樣長。表格空的地方是空字符串宝踪。
#區(qū)分兩種連接
> str_c(x2,collapse = " ")#collapse侨糟，向量內部連接使用的標點
[1] "The birch canoe slid on the smooth planks."#還原
> str_c(x2,1234,sep = "+")#sep和paste一樣，外部連接瘩燥，8個元素各自加上一個東西秕重，最終還是8個元素
[1] "The+1234"     "birch+1234"   "canoe+1234"   "slid+1234"    "on+1234"      "the+1234"     "smooth+1234" 
[8] "planks.+1234"

3.提取字符串的一部分

> str_sub(x,5,9)#從第5位提取到第9位
[1] "birch"

4.大小寫轉換

> str_to_upper(x2)#全大寫
[1] "THE"     "BIRCH"   "CANOE"   "SLID"    "ON"      "THE"     "SMOOTH"  "PLANKS."
> str_to_lower(x2)#全小寫
[1] "the"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."
> str_to_title(x2)#首字母大寫
[1] "The"     "Birch"   "Canoe"   "Slid"    "On"      "The"     "Smooth"  "Planks."

5.字符串排序

> str_sort(x2)
[1] "birch"   "canoe"   "on"      "planks." "slid"    "smooth"  "the"     "The"    
#比sort更專業(yè)，還可以按照希臘文等排序

6.字符檢測厉膀，返回等長的邏輯值向量重點

> str_detect(x2,"h")#檢測向量里的每一個元素是否含有h字母
[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
str_starts(x2,"T")#判斷是否以……開頭
str_ends(x2,"e")#判斷是否以……結尾
> ###與sum和mean連用溶耘，可以統(tǒng)計匹配的個數(shù)和比例
> sum(str_detect(x2,"h"))#多少個T
[1] 4
> mean(str_detect(x2,"h"))#T占全部的比例，也是這組數(shù)字的平均值
[1] 0.5
> as.numeric(str_detect(x2,"h"))
[1] 1 1 0 0 0 1 1 0

7.提取匹配到的字符串

> str_subset(x2,"h")
[1] "The"    "birch"  "the"    "smooth"
> #和x2[str_detect(x2,"h")]一樣

8.字符計數(shù)

> str_count(x," ")#x一個元素服鹅，數(shù)x有多少個空格
[1] 7
> str_count(x2,"o")#x2一個向量凳兵，數(shù)x2中每個元素有多少個o
[1] 0 0 1 0 1 0 2 0

9.字符串替換

> str_replace(x2,"o","A")#只替換第一個
[1] "The"     "birch"   "canAe"   "slid"    "An"      "the"     "smAoth"  "planks."
> str_replace_all(x2,"o","A")#全部替換
[1] "The"     "birch"   "canAe"   "slid"    "An"      "the"     "smAAth"  "planks."

結合正則表達式更加強大

正則表達式 - 語法
正則表達式(regular expression)描述了一種字符串匹配的模式（pattern），可以用來檢查一個串是否含有某種子串企软、將匹配的子串替換或者從某個串中取出符合某個條件的子串等庐扫。

例如：

runoo+b，可以匹配 runoob仗哨、runooob形庭、runoooooob 等，+ 號代表前面的字符必須至少出現(xiàn)一次（1次或多次）厌漂。

runoob萨醒，可以匹配 runob、runoob苇倡、runoooooob 等富纸，號代表前面的字符可以不出現(xiàn)，也可以出現(xiàn)一次或者多次（0次雏节、或1次胜嗓、或多次）高职。

colou?r 可以匹配 color 或者 colour钩乍，? 問號代表前面的字符最多只可以出現(xiàn)一次（0次、或1次）怔锌。

構造正則表達式的方法和創(chuàng)建數(shù)學表達式的方法一樣寥粹。也就是用多種元字符與運算符可以將小的表達式結合在一起來創(chuàng)建更大的表達式。正則表達式的組件可以是單個的字符埃元、字符集合涝涤、字符范圍、字符間的選擇或者所有這些組件的任意組合岛杀。

正則表達式是由普通字符（例如字符 a 到 z）以及特殊字符（稱為"元字符"）組成的文字模式阔拳。模式描述在搜索文本時要匹配的一個或多個字符串。正則表達式作為一個模板类嗤，將某個字符模式與所搜索的字符串進行匹配糊肠。
特殊字符
所謂特殊字符辨宠，就是一些有特殊含義的字符，如上面說的 runoo*b 中的 货裹，簡單的說就是表示任何字符串的意思嗤形。如果要查找字符串中的 * 符號，則需要對 * 進行轉義弧圆，即在其前加一個 : runo*ob 匹配 runoob赋兵。

許多元字符要求在試圖匹配它們時特別對待。若要匹配這些特殊字符搔预，必須首先使字符"轉義"霹期，即，將反斜杠字符\ 放在它們前面斯撮。

練習6-2

#Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community.
#1.將上面這句話作為一個長字符串经伙，賦值給tmp
tmp <- "Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community."
#2.拆分為一個由單詞組成的向量，賦值給tmp2(注意標點符號)
> tmp2 <- str_split(tmp," ")[[1]];tmp2#錯勿锅。沒有注意標點帕膜。
 [1] "Bioinformatics"      "is"                  "a"                   "new"                 "subject"            
 [6] "of"                  "genetic"             "data"                "collection,analysis" "and"                
[11] "dissemination"       "to"                  "the"                 "research"            "community."

正確答案

> tmp2 = tmp %>% 
+   str_replace(","," ") %>%#取點號
+   str_remove("[.]") %>% 
+   str_split(" ")
> tmp2
[[1]]
 [1] "Bioinformatics" "is"             "a"              "new"            "subject"        "of"            
 [7] "genetic"        "data"           "collection"     "analysis"       "and"            "dissemination" 
[13] "to"             "the"            "research"       "community"     
> tmp2 = tmp2[[1]]
> tmp2
 [1] "Bioinformatics" "is"             "a"              "new"            "subject"        "of"            
 [7] "genetic"        "data"           "collection"     "analysis"       "and"            "dissemination" 
[13] "to"             "the"            "research"       "community"

> str_remove(tmp,".")#B被去掉，.是正則表達式里任意字符的意思溢十。
[1] "ioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community."
str_remove(tmp,"\\.")#\\或中括號表達點號自己的意思

> #3.用函數(shù)返回這句話中有多少個單詞垮刹。
> length(tmp2)
[1] 16
> #4.用函數(shù)返回這句話中每個單詞由多少個字母組成。
> str_length(tmp2)
 [1] 14  2  1  3  7  2  7  4 10  8  3 13  2  3  8  9
> #5.統(tǒng)計tmp2有多少個單詞中含有字母"e"
> table(str_detect(tmp2,"e"))

FALSE  TRUE 
    9     7 
> #或sum

一.條件語句

if,ifelse,for是重點张弛，掌握
if條件語句：如果荒典。。吞鸭。就寺董。。刻剥。遮咖，否則。造虏。御吞。
if(一個邏輯值){ 一段代碼 } else { 一段代碼 }
注意只能有一個邏輯值，if不支持循環(huán)漓藕。T陶珠，執(zhí)行；F享钞，跳過揍诽。否則。
一句代碼，大括號可寫可不寫暑脆。

1.if(){ }

(1)只有if沒有else交排，那么條件是FALSE時就什么都不做

rm(list = ls())
> i = -1
> if (i<0) print('up')
[1] "up"
> if (i>0) print('up')
#理解下面代碼
if(!require(tidyr)) install.packages('tidyr')

(2)有else

> i =1
> if (i>0){
+   cat('+')#打印出本來的樣子，和 print("+")不一樣
+ } else {
+   print("-")
+ }
+

ifelse 非常重要

ifelse(x,yes,no)
3個參數(shù)
x:邏輯值饵筑。支持向量埃篓，ifelse函數(shù)支持循環(huán)。
yes:邏輯值為TRUE時的返回值
no：邏輯值為FALSE時的返回值

x是邏輯值向量

> ifelse(i>0,"+","-")
[1] "+"
> x=rnorm(10)
> y=ifelse(x>0,"+","-")
> y
 [1] "-" "+" "+" "-" "+" "-" "+" "+" "+" "+"

x換成返回邏輯值的函數(shù)
對一個向量按照是否含有某關鍵詞進行分組根资，并附上想要的關鍵詞

ifelse(str_detect(x,“h”),"+","-")

(3)多個條件

i = 0
if (i>0){ #if只能有一個邏輯值
  print('+')
} else if (i==0) {
  print('0')
} else if (i< 0){
  print('-')
}
#else if可以寫很多個

#ifelse只有3個參數(shù)架专，但是可以嵌套
ifelse(i>0,
       "+",
       ifelse(i<0,
              "-",
              "0"))
#再嵌套可以寫在 "+"或者"0"

#嵌套多不易讀
#case-when() dyplr里，會更好用

2.switch()

> cd = 3
> foo <- switch(EXPR = cd, 
+               #EXPR = "aa", 
+               aa=c(3.4,1),
+               bb=matrix(1:4,2,2),
+               cc=matrix(c(T,T,F,T,F,F),3,2),
+               dd="string here",
+               ee=matrix(c("red","green","blue","yellow")))
> foo
      [,1]  [,2]
[1,]  TRUE  TRUE
[2,]  TRUE FALSE
[3,] FALSE FALSE
> foo <- switch(#EXPR = cd, 
+               EXPR = "aa", 
+               aa=c(3.4,1),
+               bb=matrix(1:4,2,2),
+               cc=matrix(c(T,T,F,T,F,F),3,2),
+               dd="string here",
+               ee=matrix(c("red","green","blue","yellow")))
> foo
[1] 3.4 1.0

R語言 Switch語句
https://www.w3cschool.cn/r/r_switch_statement.html

長腳本管理方式

1,分成多個腳本玄帕，每個腳本最后保存Rdata部脚，下一個腳本開頭清空再加載。

image.png

2.if(F){...}, 則{}里的腳本被跳過裤纹，if(T){...},則{}里的腳本被執(zhí)行委刘，凡是帶有{}的代碼，均可以被折疊.
或者用#

image.png

二鹰椒、循環(huán)語句

1.for循環(huán)

image.png

for ( i in x ){代碼}
對x里的每個元素i進行同一操作 x多是向量锡移。i必須是x里的元素，不是順序漆际。
x的長度是多少淆珊，for循環(huán)就進行多少次
自動結束機制：到達最后一個元素
順便看一下next和break

x本身做循環(huán)主體

> x <- c(5,6,0,3)
> s=0
> for (i in x){
+   s=s+i
+   #if(i == 0) next #跳過這一循環(huán)，到下一循環(huán)
+   #if (i == 0) break #直接終止奸汇，后面都不循環(huán)
+   print(c(which(x==i),i,1/i,s))
+ }
#which(x==i)x的第幾個元素等于i施符，返回元素下標，就是現(xiàn)在是第幾輪
[1] 1.0 5.0 0.2 5.0
[1]  2.0000000  6.0000000  0.1666667 11.0000000
[1]   3   0 Inf  11   #Inf正無窮
[1]  4.0000000  3.0000000  0.3333333 14.0000000

用x的下標做循環(huán)

x <- c(5,6,0,3)
s = 0
for (i in 1:length(x)){
  s=s+x[[i]]#  循環(huán)中 取子集中括號建議寫兩個擂找，一個有時出錯
  #if(i == 3) next #跳過這一循環(huán)戳吝，到下一循環(huán)
  #if (i == 3) break#直接終止，后面都不循環(huán)
  print(c(i,x[[i]],1/i,s))
}
[1]  1  5  1 44
[1]  2.0  6.0  0.5 50.0
[1]  3.0000000  0.0000000  0.3333333 50.0000000
[1]  4.00  3.00  0.25 53.00

如何將結果存下來?

> s = 0
> result = list()#先聲明新建result贯涎，是列表听哭，列表里還沒有東西，再一個一個加元素
> for(i in 1:length(x)){
+   s=s+x[[i]]
+   result[[i]] = c(i,x[[i]],1/i,s)
+ }
> result
[[1]]
[1] 1 5 1 5

[[2]]
[1]  2.0  6.0  0.5 11.0

[[3]]
[1]  3.0000000  0.0000000  0.3333333 11.0000000

[[4]]
[1]  4.00  3.00  0.25 14.00
> #很規(guī)則柬采，簡化為一個數(shù)據框
> do.call(cbind,result)#按列組合列表里的每一個元素
     [,1] [,2]       [,3]  [,4]
[1,]    1  2.0  3.0000000  4.00
[2,]    5  6.0  0.0000000  3.00
[3,]    1  0.5  0.3333333  0.25
[4,]    5 11.0 11.0000000 14.00
#list對象很難以文本的形式導出欢唾，因此需要一個函數(shù)能快速將復雜的list結構扁平化成dataframe且警。這里要介紹的就是do.call函數(shù)粉捻。
#簡單的講，do.call 的功能就是執(zhí)行一個函數(shù)斑芜，而這個函數(shù)的參數(shù)呢肩刃，放在一個list里面, 是list的每個子元素。

2.while 循環(huán)：當……的時候

image.png

while.repeat慎用，知道怎么結束
沒有自動結束機制

i = 0

while (i < 5){
  print(c(i,i^2))
  i = i+1
}

3.repeat 語句

#注意：必須有break
i=0L
s=0L
repeat{
 i = i + 1
 s = s + i
 print(c(i,s))
 if(i==50) break
}

重點函數(shù)

sort
match
names
ifelse 和 str_detect
identical
arrange
merge 和 inner_join
unique 和 duplicated

重點知識點

向量數(shù)據框盈包、列表取子集
數(shù)據框新增列
文件讀取
Rdata的加載與保存
作圖保存
R包安裝和加載
形式參數(shù)沸呐、實際參數(shù)、默認參數(shù)

R語言遍歷呢燥、創(chuàng)建崭添、刪除文件夾

dir() #工作目錄下的文件
file.create()
file.exists(…)
file.remove()
file.rename(from, to)
file.append(file1, file2)

</article>

作者：Ruizheng
鏈接：http://www.reibang.com/p/59369bbb40ab
來源：簡書
著作權歸作者所有。商業(yè)轉載請聯(lián)系作者獲得授權叛氨，非商業(yè)轉載請注明出處呼渣。

DAY7 生信入門-進階

1.人見人愛 tidyverse(大包)

tidyr dplyr stringr ggplot2 (小包，可單獨安裝)

tidyr

dplyr 最核心，專門處理數(shù)據框

tidyr

一、數(shù)據清理

扁變長

長變扁

二、分割和合并

原始數(shù)據

分割

合并

三铐伴、處理NA

原始數(shù)據

1.去掉含有NA的行,可以選擇只根據某一列來去除

2.替換NA

3.用上一行的值填充NA

dplyr

五個基礎函數(shù)

1.mutate(),新增列

2.select(),按列篩選

(1)按列號篩選

(2)按列名篩選

(3)一組來自tidyselect的有用函數(shù)

(4)利用everything()殉疼，列名可以重排序

3.filter()篩選行

4.arrange(),按某一列對整個表格進行排序

基礎包用order實現(xiàn)跟arrange一樣的操作

5.summarise()：匯總 通常結合分組一起使用

對數(shù)據進行匯總操作,結合group_by使用實用性強

兩個實用技能

1：管道操作 %>% (cmd/ctr + shift + M)：上一步的輸出作為下一步的輸入

2：count統(tǒng)計某列的unique值

處理關系數(shù)據:即將2個表進行連接，注意：不要引入factor

1.內連inner_join,取交集

2.左連left_join

3.全連full_join

4.半連接：返回能夠與y表匹配的x表所有記錄semi_join

5.反連接：返回無法與y表匹配的x表的所記錄anti_join

6.數(shù)據的簡單合并

在相當于base包里的cbind()函數(shù)和rbind()函數(shù);注意虏劲，bind_rows()函數(shù)需要兩個表格列數(shù)相同，而bind_cols()函數(shù)則需要兩個數(shù)據框有相同的行數(shù)

練習6-1

1.將iris數(shù)據框的前4列gather褒颈，然后還原

2.將第二列分成兩列（以小數(shù)點為分隔符）然后合并

stringr

1.檢測字符串長度

2.字符串拆分與組合

3.提取字符串的一部分

4.大小寫轉換

5.字符串排序

7.提取匹配到的字符串

8.字符計數(shù)

9.字符串替換

結合正則表達式更加強大

練習6-2

一.條件語句

1.if(){ }

(1)只有if沒有else交排，那么條件是FALSE時就什么都不做

(2)有else

ifelse 非常重要

(3)多個條件

2.switch()

長腳本管理方式

二鹰椒、循環(huán)語句

1.for循環(huán)

x本身做循環(huán)主體

用x的下標做循環(huán)

如何將結果存下來?

2.while 循環(huán)：當……的時候

3.repeat 語句

重點函數(shù)

重點知識點

R語言遍歷呢燥、創(chuàng)建崭添、刪除文件夾

5.summarise()：匯總通常結合分組一起使用