1 準備工作

加載包
library(tidyverse)
library(stringr)

1.1 stringr介紹

stringr包被定義為一致的刊驴、簡單易用的字符串工具集槽棍。所有的函數(shù)和參數(shù)定義都具有一致性笼才，比如杠输，用相同的方法進行NA處理和0長度的向量處理放案。
字符串處理雖然不是R語言中最主要的功能和蚪，卻也是必不可少的双藕，數(shù)據(jù)清洗淑趾、可視化等的操作都會用到。對于R語言本身的base包提供的字符串基礎(chǔ)函數(shù)忧陪，隨著時間的積累扣泊，已經(jīng)變得很多地方不一致近范，不規(guī)范的命名，不標準的參數(shù)定義延蟹，很難看一眼就上手使用评矩。字符串處理在其他語言中都是非常方便的事情，R語言在這方面確實落后了等孵。stringr包就是為了解決這個問題稚照，讓字符串處理變得簡單易用，提供友好的字符串操作接口俯萌。

stringr的項目主頁：https://cran.r-project.org/web/packages/stringr/index.html

1.2 stringr的分類

1.2.1 字符串拼接函數(shù)
str_c: 字符串拼接果录。
str_join: 字符串拼接，同str_c咐熙。
str_trim: 去掉字符串的空格和TAB(\t)
str_pad: 補充字符串的長度
str_dup: 復制字符串
str_wrap: 控制字符串輸出格式
str_sub: 截取字符串
str_sub<- 截取字符串弱恒，并賦值，同str_sub

1.2.2 字符串計算函數(shù)
str_count: 字符串計數(shù)
str_length: 字符串長度
str_sort: 字符串值排序
str_order: 字符串索引排序棋恼，規(guī)則同str_sort

1.2.3 字符串匹配函數(shù)
str_split: 字符串分割
str_split_fixed: 字符串分割返弹，同str_split
str_subset: 返回匹配的字符串
word: 從文本中提取單詞
str_detect: 檢查匹配字符串的字符
str_match: 從字符串中提取匹配組。
str_match_all: 從字符串中提取匹配組爪飘，同str_match
str_replace: 字符串替換
str_replace_all: 字符串替換义起，同str_replace
str_replace_na:把NA替換為NA字符串
str_locate: 找到匹配的字符串的位置。
str_locate_all: 找到匹配的字符串的位置,同str_locate
str_extract: 從字符串中提取匹配字符
str_extract_all: 從字符串中提取匹配字符师崎，同str_extract

1.2.3 字符串變換函數(shù)
str_conv: 字符編碼轉(zhuǎn)換
str_to_upper: 字符串轉(zhuǎn)成大寫
str_to_lower: 字符串轉(zhuǎn)成小寫,規(guī)則同str_to_upper
str_to_title: 字符串轉(zhuǎn)成首字母大寫,規(guī)則同str_to_upper

1.2.3 參數(shù)控制函數(shù)默终，僅用于構(gòu)造功能的參數(shù)，不能獨立使用
boundary: 定義使用邊界
coll: 定義字符串標準排序規(guī)則犁罩。
fixed: 定義用于匹配的字符齐蔽，包括正則表達式中的轉(zhuǎn)義符
regex: 定義正則表達式

1.3 `stringr`包中的重要函數(shù)

函數(shù)	功能說明	R Base中對應(yīng)函數(shù)
使用正則表達式的函數(shù)
`str_extract()`	提取首個匹配模式的字符	`regmatches()`
`str_extract_all()`	提取所有匹配模式的字符	`regmatches()`
`str_locate()`	返回首個匹配模式的字符的位置	`regexpr()`
`str_locate_all()`	返回所有匹配模式的字符的位置	`gregexpr()`
`str_replace()`	替換首個匹配模式	`sub()`
`str_replace_all()`	替換所有匹配模式	`gsub()`
`str_split()`	按照模式分割字符串	`strsplit()`
`str_split_fixed()`	按照模式將字符串分割成指定個數(shù)	-
`str_detect()`	檢測字符是否存在某些指定模式	`grepl()`
`str_count()`	返回指定模式出現(xiàn)的次數(shù)	-
其他重要函數(shù)
`str_sub()`	提取指定位置的字符	`regmatches()`
`str_dup()`	丟棄指定位置的字符	-
`str_length()`	返回字符的長度	`nchar()`
`str_pad()`	填補字符	-
`str_trim()`	丟棄填充，如去掉字符前后的空格	-
`str_c()`	連接字符	`paste(),paste0()`

1.4 特殊符號

.床估，^含滴，$，*丐巫，+谈况，?，[递胧，]碑韵，(，)谓着，{，}坛掠，\和/必須使用\作為轉(zhuǎn)義
可以使用?'"' 或?"'"調(diào)出幫助文件來查看完整的特殊字符列表赊锚，匹配\n需要構(gòu)建"\n"正則表達式

?"'"
?'"'

# \n     newline
# \r     carriage return
# \t     tab
# \b     backspace
# \a     alert (bell)
# \f     form feed
# \v     vertical tab
# \\     backslash \
# \'     ASCII apostrophe '
# \"     ASCII quotation mark "
# \`     ASCII grave accent (backtick) `
# \nnn   character with given octal code (1, 2 or 3 digits)
# \xnn   character with given hex code (1 or 2 hex digits)
# \unnnn     Unicode character with given code (1--4 hex digits)
# \Unnnnnnnn     Unicode character with given code (1--8 hex digits)

2 字符串基礎(chǔ)

# 可以使用單引號或雙引號來創(chuàng)建字符串治筒。與其他語言不同，單引號和雙引號在R 中沒有區(qū)
別舷蒲。我們推薦使用"
string1 <- "This is a string"
string2 <- 'To put a "quote" inside a string, use single quotes'


# 如果忘記了結(jié)尾的引號耸袜，你會看到一個續(xù)行符`+`，如果遇到了這種情況牲平，可以按`Esc`鍵堤框，然后重新輸入

# 如果想要在字符串中包含一個單引號或雙引號，可以使用`\` 對其進行“轉(zhuǎn)義”：

(double_quote <- "\"" )
# [1] "\""
('"')
# [1] "\""

(single_quote <- '\'')
# [1] "'"
("'")
# [1] "'"

# 如果想要在字符串中包含一個反斜杠纵柿，就需要使用兩個反斜杠：\\
(x <- c("\"", "\\"))
# [1] "\"" "\\"
writeLines(x)
# "
# \

# 字符串的打印形式與其本身的內(nèi)容不是相同的蜈抓，因為**打印形式中會顯示出轉(zhuǎn)義字符**。如果想要查看字符串的初始內(nèi)容昂儒，可以使用`writelines() `函數(shù)
x <- "\u00b5"
x
# [1] "μ"

2.1 字符串長度`str_length()`

str_length(c("a", "R for data science", NA))
# [1]  1 18 NA

2.3 字符串組合`str_c()`

# 要想組合兩個或更多字符串沟使，可以使用str_c() 函數(shù)
str_c("x", "y")
# [1] "xy"

str_c("x", "y", "z")
# [1] "xyz"

# 可以使用sep 參數(shù)來控制字符串間的分隔方式：
str_c("x", "y", sep = "_")
# [1] "x_y"

# 和多數(shù)R 函數(shù)一樣，缺失值是可傳染的渊跋。如果想要將它們輸出為"NA"腊嗡，可以使用str_
replace_na()：
x <- c("abc", NA)
x
# [1] "abc" NA

str_c("|-", x, "-|")
# [1] "|-abc-|" NA 

str_c("|-", str_replace_na(x), "-|")
# [1] "|-abc-|" "|-NA-|" 


# str_c() 函數(shù)是向量化的，它可以自動循環(huán)短向量拾酝，使得其與最長的向量具有相同的長度
str_c("prefix-", c("a", "b", "c"), "-suffix")
# [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
# 要想將字符向量合并為字符串燕少，可以使用collapse() 函數(shù)：
str_c(c("x", "y", "z"), collapse = ", ")
[1] "x, y, z"

2.4 字符串取子集`str_sub()`

可以使用str_sub() 函數(shù)來提取字符串的一部分。除了字符串參數(shù)外蒿囤，str_sub() 函數(shù)中還
有start 和end 參數(shù)客们，它們給出了子串的位置（包括start 和end 在內(nèi)）：

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
# [1] "App" "Ban" "Pea"

# 負數(shù)表示從后往前數(shù)
str_sub(x, -3, -1)
# [1] "ple" "ana" "ear"

# 即使字符串過短，str_sub() 函數(shù)也不會出錯蟋软，它將返回盡可能多的字符：
str_sub("a", 1, 5)
# [1] "a"

# 還可以使用str_sub() 函數(shù)的賦值形式來修改字符串：
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple" "banana" "pear"

2.5 區(qū)域設(shè)置

# 文本大小寫轉(zhuǎn)換 str_to_lower() ,str_to_upper() ,str_to_title() 函數(shù)
# 因為不同的語言有不同的轉(zhuǎn)換規(guī)則镶摘，使得大小寫轉(zhuǎn)換比較復雜
# 土耳其語中有帶點和不帶點的兩個i，它們在轉(zhuǎn)換為大寫時是不同的：
str_to_upper(c("i", "?"))
#> [1] "I" "I"
str_to_upper(c("i", "?"), locale = "tr")
#> [1] "?" "I"
x <- c("apple", "eggplant", "banana")
# 英語
str_sort(x, locale = "en") 
#> [1] "apple" "banana" "eggplant"
# 夏威夷語
str_sort(x, locale = "haw") 
#> [1] "apple" "eggplant" "banana"

2.6 練習

2.6.1 在沒有使用stringr 的那些代碼中岳守，你會經(jīng)称喔遥看到paste() 和paste0() 函數(shù)，這兩個函數(shù)的區(qū)別是什么湿痢？ stringr 中的哪兩個函數(shù)與它們是對應(yīng)的涝缝？這些函數(shù)處理NA 的方式有什么不同？

用法
paste (..., sep = " ", collapse = NULL)
paste0(..., collapse = NULL)
用法等同于str_c()
實例

# sep譬重，"字符串"分隔符拒逮，用于字符串間連接
# collapse，"向量"分隔符臀规，用于向量元素連接

# paste()默認使用空格" "連接
paste("abc","def","ghi")
# [1] "abc def ghi"
paste("abc","def","ghi",sep = ".")
# [1] "abc.def.ghi"
# collapse未發(fā)揮作用
paste("abc","def","ghi",collapse = ".")
# [1] "abc def ghi"
paste(c("abc","def","ghi"),collapse = ".")
# [1] "abc.def.ghi"

# paste0()默認無間隔連接
paste0("abc","def","ghi")
# [1] "abcdefghi"
# sep參數(shù)不適用
paste0("abc","def","ghi",sep = ".")
# [1] "abcdefghi."
# paste0()函數(shù)collapse參數(shù)不適用單個對象
paste0("abc","def","ghi",collapse = ".")
# [1] "abcdefghi"
# paste0()函數(shù)collapse參數(shù)適用于對象
paste0(c("abc","def","ghi"),collapse = ".")
# "abc.def.ghi"


# str_c()函數(shù)NA會傳染滩援，paste/paste0不會
paste("abc","def","ghi",NA)
# [1] "abc def ghi NA"
paste0("abc","def","ghi",NA)
# [1] "abcdefghiNA"
str_c("abc","def","ghi",NA)
# [1] NA

2.6.2 用自己的語言描述一下str_c() 函數(shù)的sep 和collapse 參數(shù)有什么區(qū)別？

# sep塔嬉，"字符串"分隔符玩徊，用于字符串間連接
# collapse租悄，"向量"分隔符，用于向量元素連接
# str_c()默認無間隔連接
str_c("abc","def","ghi",sep = ".")
# [1] "abc.def.ghi"
str_c("abc","def","ghi",collapse = ".")
# [1] "abcdefghi"
str_c(c("abc","def","ghi"),sep = ".")
# [1] "abc" "def" "ghi"
str_c(c("abc","def","ghi"),collapse = ".")
# [1] "abc.def.ghi"

2.6.3 使用str_length() 和str_sub() 函數(shù)提取出一個字符串最中間的字符恩袱。如果字符串中的

字符數(shù)是偶數(shù)泣棋，你應(yīng)該怎么做？

# floor：向下取整畔塔，即不大于該數(shù)字的最大整數(shù)
# ceiling：向上取整潭辈，即不小于該數(shù)字的最小整數(shù)
# trunc：取整數(shù)部分
# round：保留幾位小數(shù)
# signif：保留幾位有效數(shù)字，常用于科學技術(shù)

x <- c("a", "abc", "abcd", "abcde", "abcdef")
# 統(tǒng)計字符串長度
(len <- str_length(x))
(m <- ceiling(l/2))
(n <- floor(l/2))
# 利用求余符號"%%"判斷字符串奇偶
ifelse (len%%2 !=0,str_sub(x, m, m),str_sub(x,n,n+1))
# if_else (len%%2 !=0,str_sub(x, m, m),str_sub(x,n,n+1))
# [1] "a"  "b"  "bc" "c"  "cd"

2.6.4 str_wrap() 函數(shù)的功能是什么澈吨？應(yīng)該在何時使用這個函數(shù)把敢？

轉(zhuǎn)換字符串輸出格式
str_wrap(string, width = 80, indent = 0, exdent = 0)

參數(shù)	Arguments
string	重新格式化字符串的字符向量
width	目標行的字符寬度
indent	每段首行縮進
exdent	每個段落縮進
Value	字符向量的格式化字符串

thanks <- str_c(readLines(R.home("doc/THANKS")), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
cat(str_wrap(thanks, width = 40), "\n")
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")

2.6.5 str_trim() 函數(shù)的功能是什么？其逆操作是哪個函數(shù)棚辽？

# 去除空格
str_trim("   ab cd  ")
# [1] "ab cd"
str_trim("   ab cd  ","both")
# [1] "ab cd"
str_trim("   ab cd  ","left")
# [1] "ab cd  "
str_trim("   ab cd  ","right")
# [1] "   ab cd"

2.6.6 編寫一個函數(shù)將字符向量轉(zhuǎn)換為字符串技竟，例如，將字符向量c("a", "b", "c") 轉(zhuǎn)換為字符串a(chǎn)屈藐、b 和c榔组。仔細思考一下，如果給定一個長度為0联逻、1 或2 的向量搓扯，那么這個函數(shù)應(yīng)該怎么做？

str_commasep <- function(x, delim = ",") {
  n <- length(x)
  if (n == 0) {
    ""
  } else if (n == 1) {
    x
  } else if (n == 2) {
    # no comma before and when n == 2
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
    # commas after all n - 1 elements
    not_last <- str_c(x[seq_len(n - 1)], delim)
    # prepend "and" to the last element
    last <- str_c("and", x[[n]], sep = " ")
    # combine parts with spaces
    str_c(c(not_last, last), collapse = " ")
  }
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"

3 使用正則表達式進行模式匹配

3.1 `str_view()` , `str_view_all()`

str_view() 和str_view_all() 函數(shù)來學習正則表達式包归。這兩個函數(shù)接受一個字符向量和一個正則表達式锨推，并顯示出它們是如何匹配的

str_view() 單個匹配
str_view_all() 全部匹配

3.1 基礎(chǔ)匹配

x <- c("apple", "banana", "pear")
str_view(x, "an")

image.png

# 通配符"."，它可以匹配任意字符（除了換行符）：
str_view(x, ".a.")

和字符串一樣公壤，正則表達式也使用反斜杠來去除某些字符的特殊含義换可。因此，如果要匹配.厦幅，那么你需要的正則表達式就是\.沾鳄。但是\ 在字符串中也用作轉(zhuǎn)義字符，所以正則表達式\. 的字符串形式應(yīng)是\\.：
# 檢索向量中的"a.c"
str_view(c("abc", "a.c", "bef"), "a\\.c")

image.png

# "\"在正則表達式中用作轉(zhuǎn)義字符确憨，為匹配"\"這個字符需要建立形式為"\\"的正則表達式
(x <- "a\\b")
# [1] "a\\b"
writeLines(x)
#> a\b

# "\"需要轉(zhuǎn)義译荞，因此為了匹配賦值表達式的"\\"，需要轉(zhuǎn)義兩個"\",也即"\\\\"
str_view(x, "\\\\")
#> a\b

# 第二個"被\轉(zhuǎn)義,所以在輸入的時候會提示續(xù)行符`+`
a1 <- "\"

a2 <- "\\"
writeLines(a2)
# \

# 奇數(shù)個反斜杠‘\’,結(jié)尾的"被\轉(zhuǎn)義,所以在輸入的時候依然會提示續(xù)行符`+`
a3 <- "\\\"

# 
a4 <- "\\\\"
writeLines(a4)
# \\

3.2 練習

3.2.1 解釋一下為什么這些字符串不能匹配一個反斜杠\："\"休弃、"\\"吞歼、"\\\"。

奇數(shù)個反斜杠"\"最后一個反斜杠"\"會對結(jié)尾的雙引號進行轉(zhuǎn)義塔猾，然后會提示續(xù)行符"+"
而為了匹配反斜杠\篙骡，需要建立\的正則表達式，每個反斜杠均需要轉(zhuǎn)義，因此匹配一個反斜杠糯俗，需要輸入四個反斜杠

3.2.2 如何匹配字符序列"'\ 慎皱？

x <- "\"\'\\"
x
# [1] "\"'\\"
writeLines(x)
# "'\

3.2.3 正則表達式...... 會匹配哪種模式？如何用字符串來表示這個正則表達式叶骨？

 y <- "\..\..\.."
# Error: '\.' is an unrecognized escape in character string starting ""\."

3.3 錨點(`^`,`$`)

^ 從字符串開頭進行匹配。
$ 從字符串末尾進行匹配祈匙。

x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
str_view(x, "^apple$")

3.4 練習

3.4.1 如何匹配字符串 "$^$" 忽刽？

# 字符串賦值
y <- "???$^$**&&"
# 錨點"^","$"為特殊符號，需要轉(zhuǎn)義"\"夺欲，但是轉(zhuǎn)義符號也是特殊符號跪帝，也是需要轉(zhuǎn)義
str_view(y,"\\$\\^\\$")

3.4.2 給定stringr::words 中的常用單詞語料庫，創(chuàng)建正則表達式以找出滿足下列條件的所有單詞些阅。

a. 以y 開頭的單詞伞剑。
b. 以x 結(jié)尾的單詞。
c. 長度正好為3 個字符的單詞市埋。（不要使用str_length() 函數(shù)黎泣，這是作弊！）
d. 具有7個或更多字符的單詞缤谎。
因為這個列表非常長抒倚，所以你可以設(shè)置str_view() 函數(shù)的match 參數(shù)，只顯示匹配的
單詞（match = TRUE）或未匹配的單詞（match = FALSE）坷澡。

# 以y 開頭的單詞
str_view(words,"^y",match = T)
table(str_count(words,"^y"))
#  0   1 
#  974   6

# 以x 結(jié)尾的單詞
str_view(words,"x$",match = T)
table(str_count(words,"x$"))
#  0   1 
#  976   4 

# 長度正好為3 個字符的單詞
str_view(words,"^(...)$",match = T)
table(str_count(words,"^(...)$"))
# 0   1 
# 870 110 
table(str_length(words))
# 1   2   3   4   5   6   7   8   9  10  11 
# 1  18 110 263 200 169 119  57  30   9   4 

# 具有7個或更多字符的單詞,括號內(nèi)連續(xù)七個點"."
str_view(words,"(.......)",match = T)

3.5 字符類與字符選項

# "."托呕，它可以匹配除"換行符"外的任意字符
# \d 可以匹配任意數(shù)字
# \s 可以匹配任意空白字符（如空格、制表符和換行符）
# [abc] 可以匹配a频敛、b 或c
# [^abc] 可以匹配除a项郊、b、c 外的任意字符
# "|" 的優(yōu)先級很低斟赚，所以abc|xyz 匹配的是abc 或xyz
# 可以使用括號讓"|"表達得更清晰一些,str_view(c("grey", "gray"), "gr(e|a)y")

要想創(chuàng)建包含\d 或\s 的正則表達式着降，你需要在字符串中對\進行轉(zhuǎn)義，因此需要輸入"\\d" 或"\\s"

3.6　練習

3.6.1 創(chuàng)建正則表達式來找出符合以下條件的所有單詞汁展。

a. 以元音字母開頭的單詞鹊碍。
b. 只包含輔音字母的單詞（提示：考慮一下匹配“非”元音字母）。
c. 以ed 結(jié)尾食绿，但不以eed 結(jié)尾的單詞侈咕。
d. 以ing 或ize 結(jié)尾的單詞。

#  以元音字母開頭的單詞
str_view(words,"^[aoeiu]",match = T)

# 只包含輔音字母的單詞（提示：考慮一下匹配“非”元音字母）
str_view(words,"^[^aoeiu]",match = T)


# 以ed 結(jié)尾器紧，但不以eed 結(jié)尾的單詞
str_view(words,"([^e]ed)$",match = T)
str_view(words,"[^e]ed$",match = T)

# 以ing 或ize 結(jié)尾的單詞
str_view(words,"((ing)|(ize))$",match = T)

3.6.2 實際驗證一下規(guī)則：i 總是在e 前面耀销，除非i 前面有c。

str_view(words,"((cei)|[^c]ie)",match = T)

3.6.3 q 后面總是跟著一個u 嗎？

table(ifelse(str_detect(words,"qu"),"Yes","NO"))
# NO Yes 
# 970  10

3.6.4 編寫一個正則表達式來匹配英式英語單詞熊尉，排除美式英語單詞罐柳。

3.6.5 創(chuàng)建一個正則表達式來匹配你所在國家的電話號碼。

# 賦值
(x <- c("+86-18217047048","1223333333","10545384333"))
# 正則表達式匹配狰住，所有的特殊字符需要轉(zhuǎn)義张吉，如"+"需要構(gòu)建正則表達式"\+"，但是\也需要轉(zhuǎn)義催植，所以需要輸入"\\+"肮蛹，同理"\\d"
# str_view(x,".86-1[89]\\d\\d\\d\\d\\d\\d\\d\\d\\d")
str_view(x,"\\+86-1[89]\\d\\d\\d\\d\\d\\d\\d\\d\\d")

3.7 重復

正則表達式的另一項強大功能是，其可以控制一個模式能夠匹配多少次创南。

?：0 次或1 次伦忠。
+：1 次或多次。
*：0 次或多次稿辙。
{n}：匹配n 次昆码。
{n,}：匹配n 次或更多次。
{,m}：最多匹配m 次邻储。
{n, m}：匹配n 到m 次赋咽。

x <- "MDCCCCCCCLXXXCCCCLXXXVIII"

str_view(x, "CC?")

image.png

str_view(x, "CC+")

image.png

str_view_all(x, "CC+")

image.png

str_view(x, "C[LX]+")

image.png

str_view_all(x, "C[LX]+")

image.png

str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
# 默認的正則表達式會匹配盡可能長的字符串，但是在正則表達式后加"?"則可以將匹配方式改為匹配盡可能短的字符串
str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')

3.8 練習

3.8.1 給出與?吨娜、+ 和* 等價的{m, n} 形式的正則表達式冬耿。

# `?`：0 次或1 次
str_view("aaaaabbaabb","a?")
str_view("aaaaabbaabb","a{0,1}")

#`+`：1 次或多次
str_view("aaaaaabbaabb","a+")
str_view("aaaaaabbaabb","a{1,}")

# `* `：0 次或多次
str_view("aaaaaabbaabb","a*")
str_view("aaaaaabbaabb","a{0,}")

3.8.2 用語言描述以下正則表達式匹配的是何種模式（仔細閱讀來確認我們使用的是正則表達

式，還是定義正則表達式的字符串）萌壳？
a. ^.*$
b. "\{.+\}"
c. \d{4}-\d{2}-\d{2}
d. "\\{4}"

3.8.3 創(chuàng)建正則表達式來找出滿足以下條件的所有單詞亦镶。

a. 以3 個輔音字母開頭的單詞。
b. 有連續(xù)3 個或更多元音字母的單詞袱瓮。
c. 有連續(xù)2 個或更多元音—輔音配對的單詞缤骨。

str_view(words,"^([^aoieu]{3})",match = T)
str_view(words,"[aoieu]{3,}",match = T)
str_view(words,"([aoieu][^aoieu]){2,}",match = T)

3.9 分組與回溯引用

括號還可以定義“分組”，可以通過回溯引用（如\1尺借、\2 等）來引用這些分組

# "\1"代表正則表達式(..)绊起，為匹配"\1"，需輸入 "\\1"
str_view(fruit, "(..)\\1", match = TRUE)
# 括號()代表分組燎斩，\\n代表引用第n個括號()
# (..)\\1(...)(...)\\3虱歪，其中\(zhòng)\1代表重復引用第1個括號(..)，\\3重復引用第3個括號(...)
str_view("abcdcdxyzefgefgefghijk", "(..)\\1", match = TRUE)
str_view("abcdcdxyzefgefgefghijk", "(..)\\1(...)(...)\\3", match = TRUE)

3.10 練習

3.10.1 用語言描述以下正則表達式會匹配何種模式栅表？

a. (.)\1\1
b. "(.)(.)\2\1"
c. (..)\1
d. "(.).\1.\1"
e. "(.)(.)(.).*\3\2\1"

3.10.2 創(chuàng)建正則表達式來匹配出以下單詞笋鄙。

a. 開頭字母和結(jié)尾字母相同的單詞。
b. 包含一對重復字母的單詞（例如怪瓶，church 中包含了重復的ch）萧落。
c. 包含一個至少重復3 次的字母的單詞（例如，eleven 中的e 重復了3 次）。

# (\\1?$)為匹配含一個字符的字符串
str_view(words,"^([A-Za-z])((.*(\\1$))|(\\1?$))", match = TRUE)
str_view(words,"([A-Za-z][A-Za-z])(.*)\\1", match = TRUE)
str_view(words,"([A-Za-z])(.*)\\1(.*)\\1", match = TRUE)

4 工具

利用正則表達式多種stringr 函數(shù)找岖，可以：

確定與某種模式相匹配的字符串陨倡；
找出匹配的位置；
提取出匹配的內(nèi)容许布；
使用新值替換匹配內(nèi)容兴革；
基于匹配拆分字符串。

4.1 匹配檢測

str_detect()
要想確定一個字符向量能否匹配一種模式蜜唾，可以使用str_detect() 函數(shù)帖旨。它返回一個與輸入向量具有同樣長度的邏輯向量

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE

# 從數(shù)學意義上來說，邏輯向量中的FALSE 為0灵妨，TRUE 為1。這使得在匹配特別大的向量時落竹，sum() 和mean() 函數(shù)能夠發(fā)揮更大的作用
# 統(tǒng)計以t開頭的常用單詞
sum(str_detect(words, "^t"))
#> [1] 65
# 計算以元音字母結(jié)尾的常用單詞的比例
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

# 當邏輯條件非常復雜時泌霍，使用邏輯運算符將多個str_detect() 調(diào)用組合起來會更容易
# 找出至少包含一個元音字母的所有單詞，然后取反
no_vowels_1 <- !str_detect(words, "[aeiou]")
# 找出僅包含輔音字母（非元音字母）的所有單詞
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
#> [1] TRUE

str_subset()

# str_detect() 函數(shù)的一種常見用法是選取出匹配某種模式的元素述召。你可以通過邏輯取子集方式來完成這種操作朱转，也可以使用便捷的str_subset() 包裝器函數(shù)
# c[]，表示取向量的的元素
words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

# 字符串通常會是數(shù)據(jù)框的一列积暖，可以使用filter 操作
df <- tibble(
  word = words,
  i = seq_along(word)
)
df %>%
  filter(str_detect(words, "x$"))
# # A tibble: 4 x 2
# word      i
# <chr> <int>
# 1 box     108
# 2 sex     747
# 3 six     772
# 4 tax     841
# Warning message:
#   `...` is not empty.

str_count()
str_detect() 函數(shù)的一種變體藤为，不簡單地返回是或否，而是返回字符串中匹配的數(shù)量：

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均來看夺刑，每個單詞中有多少個元音字母缅疟？
mean(str_count(words, "[aeiou]"))
#> [1] 1.99
str_count() 也完全可以同mutate() 函數(shù)一同使用：
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)

很多stringr 函數(shù)都是成對出現(xiàn)的：一個函數(shù)用于單個匹配，另一個函數(shù)用于全部匹配遍愿，后者會有后綴_all存淫。

4.2 練習

試著使用兩種方法來解決以下每個問題，一種方法是使用單個正則表達式沼填，另一種方法是使用多個str_detect() 函數(shù)的組合桅咆。
a. 找出以x 開頭或結(jié)尾的所有單詞。
b. 找出以元音字母開頭并以輔音字母結(jié)尾的所有單詞坞笙。
c. 是否存在包含所有元音字母的單詞岩饼？
d. 哪個單詞包含最多數(shù)量的元音字母？哪個單詞包含最大比例的元音字母薛夜？（提示：
分母應(yīng)該是什么籍茧？）

# 以x 開頭或結(jié)尾
words[str_detect(words,"(^x(.*))|((.*)x$)")]
str_subset(words,"(^x(.*))|((.*)x$)")

# 以元音字母開頭并以輔音字母結(jié)尾
words[str_detect(words,"^[aoiue](.*)[^aoiue]$")]
words[(str_detect(words,"^[aoiue](.*)"))&(str_detect(words,"(.*)[^aoiue]$"))]
str_subset(words,"^[aoiue](.*)[^aoiue]$")

# 包含所有元音字母的單詞
words[str_detect(words, "a") &
        str_detect(words, "e") &
        str_detect(words, "i") &
        str_detect(words, "o") &
        str_detect(words, "u")
      ]
# character(0)

#包含最多數(shù)量的元音字母
words[
  which(
    str_count(words, "[aeiou]") == max(str_count(words, "[aeiou]"))
    )
  ]
# words[str_count(words, "[aeiou]") == max(str_count(words, "[aeiou]"))]
# [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
# [6] "experience"  "individual"  "television"

4.3 提取匹配內(nèi)容

要想提取匹配的實際文本，我們可以使用str_extract()函數(shù)梯澜。

stringr::sentences
length(sentences)
# [1] 720
head(sentences)
# [1] "The birch canoe slid on the smooth planks." 
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."     
# [4] "These days a chicken leg is a rare dish."   
# [5] "Rice is often served in round bowls."       
# [6] "The juice of lemons makes fine punch."

# 創(chuàng)建一個顏色名稱向量硕糊，然后將其轉(zhuǎn)換成一個正則表達式：

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
# [1] "red|orange|yellow|green|blue|purple"
# 選取出包含一種顏色的句子，再從中提取出顏色
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"

more <- sentences[str_count(sentences,color_match)>1]
str_view_all(more,color_match)
str_extract(more,color_match)
# [1] "blue"   "green"  "orange"

str_extract_all(more,color_match)
# [[1]]
# [1] "blue" "red" 
# 
# [[2]]
# [1] "green" "red"  
# 
# [[3]]
# [1] "orange" "red"

# 如果設(shè)置了simplify = TRUE，那么str_extract_all() 會返回一個矩陣简十，其中較短的匹配會擴展到與最長的匹配具有同樣的長度：
str_extract_all(more, color_match, simplify = TRUE)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "green" "red"
#> [3,] "orange" "red"
x <- c("a", "a b", "a b c")

str_extract_all(x, "[a-z]", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "a" "" ""
#> [2,] "a" "b" ""
#> [3,] "a" "b" "c"

4.4 練習

4.4.1 在前面的示例中檬某，你或許已經(jīng)發(fā)現(xiàn)正則表達式匹配了flickered，這并不是一種顏色螟蝙。修

改正則表達式來解決這個問題恢恼。

colors <- c(
  "red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|")
color_match
[1] "red|orange|yellow|green|blue|purple"
# 修改
## 我們在 10.3.3 中錨點部分提到" \b："表示匹配單詞間的邊界。
color_match2 <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")
color_match2
# [1] "\\b(red|orange|yellow|green|blue|purple)\\b"
more2 <- sentences[str_count(sentences, color_match2) > 1]
str_view_all(more2, color_match2, match = TRUE)

4.4.2 從Harvard sentences 數(shù)據(jù)集中提取以下內(nèi)容胰默。

a. 每個句子的第一個單詞场斑。
b. 以ing 結(jié)尾的所有單詞。
c. 所有復數(shù)形式的單詞。

# 每個句子的第一個單詞
str_extract(sentences, "[A-Za-z]+") %>% 
head()
# [1] "The"   "Glue"  "It"    "These" "Rice"  "The" 

# 以 ing 結(jié)尾的所有單詞午乓。
pattern <- "\\b[A-Za-z]+ing\\b"
sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern)))

4.5 分組匹配

括號在正則表達式中可以闡明優(yōu)先級珠插，還能對正則表達式進行分組，分組可以在匹配時回溯引用青责，還可以使用括號來提取一個復雜匹配的各個部分。

# 找出跟在a 或the 后面的所有單詞
# 直接使用正則表達式定義“單詞”有一點難度取具，但是可以通過一種簡單的近似定義"至少有1 個非空格字符的字符序列"
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
#> [1] "the smooth" "the sheet" "the depth" "a chicken"
#> [5] "the parked" "the sun" "the huge" "the ball"
#> [9] "the woman" "a helps"

str_extract() 函數(shù)可以給出完整匹配脖隶；
-str_match() 函數(shù)則可以給出每個獨立分組。str_match() 返回的不是字符向量暇检，而是一個矩陣产阱，其中一列是完整匹配，后面的列是每個分組的匹配：

str_match(has_noun,nonu)
# or has_noun %>% str_match(noun)
#       [,1]         [,2]  [,3]     
# [1,] "the smooth" "the" "smooth" 
# [2,] "the sheet"  "the" "sheet"  
# [3,] "the depth"  "the" "depth"  
# [4,] "a chicken"  "a"   "chicken"
# [5,] "the parked" "the" "parked" 
# [6,] "the sun"    "the" "sun"    
# [7,] "the huge"   "the" "huge"   
# [8,] "the ball"   "the" "ball"   
# [9,] "the woman"  "the" "woman"  
# [10,] "a helps"    "a"   "helps"

如果數(shù)據(jù)是保存在tibble 中的块仆，那么使用tidyr::extract() 會更容易构蹬。這個函數(shù)的工作方式與str_match() 函數(shù)類似，只是要求為每個分組提供一個名稱悔据，以作為新列放在tibble 中：

tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)
#> # A tibble: 720 × 3
#> sentence article noun
#> * <chr> <chr> <chr>
#> 1 The birch canoe slid on the smooth planks. the smooth
#> 2 Glue the sheet to the dark blue background. the sheet
#> 3 It's easy to tell the depth of a well. the depth
#> 4 These days a chicken leg is a rare dish. a chicken
#> 5 Rice is often served in round bowls. <NA> <NA>
#> 6 The juice of lemons makes fine punch. <NA> <NA>
#> # ... with 714 more rows

與str_extract()函數(shù)一樣怎燥，如果想要找出每個字符串的所有匹配，你需要使用str_match_all() 函數(shù)蜜暑。

4.6　練習

4.6.1 找出跟在一個數(shù)詞（one铐姚、two、three 等）后面的所有單詞肛捍，提取出數(shù)詞與后面的單詞隐绵。

# \b：單詞邊界
# \w：任意單詞字符
# \W：任意非單詞字符

numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences[str_detect(sentences, numword)] %>%  str_extract(numword)

4.6.2 找出所有縮略形式，分別列出撇號前面和后面的部分拙毫。

contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences[str_detect(sentences, contraction)] %>%
  str_extract(contraction) %>%
  str_split("'")

4.7 替換匹配內(nèi)容

str_replace() 和str_replace_all()函數(shù)可以使用新字符串替換匹配內(nèi)容依许。最簡單的應(yīng)用
是使用固定字符串替換匹配內(nèi)容：

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"

通過提供一個命名向量，使用str_replace_all() 函數(shù)可以同時執(zhí)行多個替換：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"

除了使用固定字符串替換匹配內(nèi)容缀蹄，你還可以使用回溯引用來插入匹配中的分組峭跳。在下面
的代碼中膘婶，我們交換了第二個單詞和第三個單詞的順序：

sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."

4.8 練習

4.8.1 使用反斜杠替換字符串中的所有斜杠。

x <- c("a/b","a/b/c")
(str_replace_all(x,"/","\\\\"))
# [1] "a\\b"    "a\\b\\c"
writeLines(str_replace_all(x,"/","\\\\"))
# a\b
# a\b\c

4.8.2 使用replace_all() 函數(shù)實現(xiàn)str_to_lower() 函數(shù)的一個簡單版蛀醉。

LETTERS2letters <- letters
names(LETTERS2letters) <- LETTERS
str_replace_all(words, LETTERS2letters)

4.8.3 交換words 中單詞的首字母和末尾字母悬襟，其中哪些字符串仍然是個單詞？

4.9 拆分

str_split()函數(shù)可以將字符串拆分為多個片段拯刁。

# 將句子拆分成單詞
# 字符向量的每個分量會包含不同數(shù)量的片段脊岳，所以str_split() 會返回一個列表
y <- str_split(head(sentences,5)," ")
# [[1]]
# [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
# [8] "planks."
# 
# [[2]]
# [1] "Glue"        "the"         "sheet"       "to"          "the"        
# [6] "dark"        "blue"        "background."
# 
# [[3]]
# [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
# 
# [[4]]
# [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
# [8] "rare"    "dish."  
# 
# [[5]]
# [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."
y[[1]][1]
# [1] "The"

# 如果你拆分的是長度為1 的向量，那么只要簡單地提取列表的第一個元素即可：
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
#> [1] "a" "b" "c" "d"

# 通過設(shè)置simplify = TRUE 返回一個矩陣
str_split(head(sentences,5)," ", simplify = TRUE)
#      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]          [,9]   
# [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."     ""     
# [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background." ""     
# [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"           "well."
# [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"        "dish."
# [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""   



# 還可以設(shè)定拆分片段的最大數(shù)量
str_split(head(sentences,5)," ", n=2,simplify = TRUE)
#      [,1]    [,2]                                    
# [1,] "The"   "birch canoe slid on the smooth planks."
# [2,] "Glue"  "the sheet to the dark blue background."
# [3,] "It's"  "easy to tell the depth of a well."     
# [4,] "These" "days a chicken leg is a rare dish."    
# [5,] "Rice"  "is often served in round bowls." 




# 除了模式垛玻，還可以通過"字母割捅、行、句子和單詞邊界"（boundary() 函數(shù)）來拆分字符串
# boundary(type = c("character", "line_break", "sentence", "word"),  skip_word_none = NA, ...)
str_split(head(sentences,5),boundary("word"))
# [[1]]
# [1] "The"    "birch"  "canoe"  "slid"   "on"     "the"    "smooth" "planks"
# 
# [[2]]
# [1] "Glue"       "the"        "sheet"      "to"         "the"        "dark"       "blue"      
# [8] "background"
# 
# [[3]]
# [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well" 
# 
# [[4]]
# [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"       "rare"    "dish"   
# 
# [[5]]
# [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls" 
str_split(head(sentences,5),boundary("word"),simplify = TRUE)
#      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         [,9]  
# [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks"     ""    
# [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background" ""    
# [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          "well"
# [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       "dish"
# [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls"  ""           ""    
str_split(head(sentences,5),boundary("word"),n=2,simplify = TRUE)
#     [,1]    [,2]                                    
# [1,] "The"   "birch canoe slid on the smooth planks."
# [2,] "Glue"  "the sheet to the dark blue background."
# [3,] "It's"  "easy to tell the depth of a well."     
# [4,] "These" "days a chicken leg is a rare dish."    
# [5,] "Rice"  "is often served in round bowls."

sentences %>%
  head(2) %>%
  str_split(boundary("character"))
# [[1]]
# [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i" "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t"
# [34] "h" " " "p" "l" "a" "n" "k" "s" "."
# 
# [[2]]
# [1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t" "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b"
# [34] "a" "c" "k" "g" "r" "o" "u" "n" "d" "."
sentences %>%
  head(5) %>%
  str_split(boundary("line_break"))
# [[1]]
# [1] "The "    "birch "  "canoe "  "slid "   "on "     "the "    "smooth " "planks."
# 
# [[2]]
# [1] "Glue "       "the "        "sheet "      "to "         "the "        "dark "       "blue "       "background."
# 
# [[3]]
# [1] "It's "  "easy "  "to "    "tell "  "the "   "depth " "of "    "a "     "well." 
# 
# [[4]]
# [1] "These "   "days "    "a "       "chicken " "leg "     "is "      "a "       "rare "    "dish."   
# 
# [[5]]
# [1] "Rice "   "is "     "often "  "served " "in "     "round "  "bowls." 
sentences %>%
  head(5) %>%
  str_split(boundary("sentence"))
# [[1]]
# [1] "The birch canoe slid on the smooth planks."
# 
# [[2]]
# [1] "Glue the sheet to the dark blue background."
# 
# [[3]]
# [1] "It's easy to tell the depth of a well."
# 
# [[4]]
# [1] "These days a chicken leg is a rare dish."
# 
# [[5]]
# [1] "Rice is often served in round bowls."

4.10 練習

4.10.1 拆分字符串"apples, pears, and bananas"帚桩。

"apples, pears, and bananas" %>%
  str_split(boundary("character"))
# [[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a" "n" "a" "n" "a" "s"

"apples, pears, and bananas" %>%
  str_split(boundary("word"))
# [[1]]
# [1] "apples"  "pears"   "and"     "bananas"

"apples, pears, and bananas" %>%
  str_split(boundary("sentence"))
# [[1]]
# [1] "apples, pears, and bananas"

4.10.2 為什么使用boundary("word") 的拆分效果要比" " 好亿驾？

# 使用空格的情況
"apples, pears, and bananas" %>%
  str_split(" ")
# [[1]]
# [1] "apples," "pears,"  "and"     "bananas"

4.10.3 使用空字符串（""）進行拆分會得到什么結(jié)果？嘗試一下账嚎，然后閱讀文檔莫瞬。

# 使用空字符串（""）進行拆分，會拆分所有字符
"apples, pears, and bananas" %>%
  str_split("")
# [[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a" "n" "a" "n" "a" "s"

4.11 定位匹配內(nèi)容

str_locate()和str_locate_all() 函數(shù)可以給出每個匹配的開始位置和結(jié)束位置醉锄。
使用str_locate() 函數(shù)找出匹配的模式，然后使用str_sub()函數(shù)來提取或修改匹配的內(nèi)容浙值。

head(sentences,5)
# [1] "The birch canoe slid on the smooth planks."  "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."      "These days a chicken leg is a rare dish."   
# [5] "Rice is often served in round bowls."       
str_locate(head(sentences,5),"days")
#       start end
# [1,]    NA  NA
# [2,]    NA  NA
# [3,]    NA  NA
# [4,]     7  10
# [5,]    NA  NA

5 其他類型的模式

當使用一個字符串作為模式時恳不，R 會自動調(diào)用regex() 函數(shù)對其進行包裝：

正常調(diào)用：

str_view(fruit, "nana")

上面形式是以下形式的簡寫

str_view(fruit, regex("nana"))
你可以使用regex() 函數(shù)的其他參數(shù)來控制具體的匹配方式。

ignore_case = TRUE 既可以匹配大寫字母开呐，也可以匹配小寫字母烟勋，它總是使用當前的區(qū)域設(shè)置：

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))

multiline = TRUE 可以使得^ 和$ 從每行的開頭和末尾開始匹配，而不是從完整字符串
的開頭和末尾開始匹配：

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"

comments = TRUE 可以讓你在復雜的正則表達式中加入注釋和空白字符筐付，以便更易理解卵惦。
匹配時會忽略空格和# 后面的內(nèi)容。如果想要匹配一個空格瓦戚，你需要對其進行轉(zhuǎn)義："\ "：

phone <- regex("
\\(? # 可選的開括號
(\\d{3}) # 地區(qū)編碼
[)- ]? # 可選的閉括號沮尿、短劃線或空格
(\\d{3}) # 另外3個數(shù)字
[ -]? # 可選的空格或短劃線
(\\d{3}) # 另外3個數(shù)字
", comments = TRUE)
str_match("514-791-8141", phone)
#> [,1] [,2] [,3] [,4]
#> [1,] "514-791-814" "514" "791" "814"
- dotall = TRUE 可以使得. 匹配包括\n 在內(nèi)的所有字符。

5.1 除了regex()较解，你還可以使用其他3 種函數(shù)畜疾。

fixed() 函數(shù)可以按照字符串的字節(jié)形式進行精確匹配，它會忽略正則表達式中的所有特殊字符印衔，并在非常低的層次上進行操作啡捶。這樣可以讓你不用進行那些復雜的轉(zhuǎn)義操作，而且速度比普通正則表達式要快很多奸焙。從以下的微基準測試可以看出瞎暑，在這個簡單的示例中彤敛，它的速度差不多是普通正則表達式的3 倍：

microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> fixed 116 117 136 120 125 389 20 a
#> regex 333 337 346 338 342 467 20 b

在匹配非英語數(shù)據(jù)時，要慎用fixed() 函數(shù)了赌。它可能會出現(xiàn)問題墨榄，因為此時同一個字符經(jīng)常有多種表達方式。例如揍拆，定義á 的方式有兩種：一種是單個字母a渠概，另一種是a 加上重音符號

a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
#> [1] "á" "á"
a1 == a2
#> [1] FALSE

這兩個字母的意義相同，但因為定義方式不同嫂拴，所以fixed() 函數(shù)找不到匹配播揪。然而，你可以使用接下來將要介紹的coll() 函數(shù)筒狠，按照我們使用的字符比較規(guī)則來進行匹配：

str_detect(a1, fixed(a2))
#> [1] FALSE
str_detect(a1, coll(a2))
#> [1] TRUE

coll()函數(shù)使用標準排序規(guī)則來比較字符串猪狈，這在進行不區(qū)分大小寫的匹配時是非常有效的。注意辩恼，可以在coll() 函數(shù)中設(shè)置locale 參數(shù)雇庙，以確定使用哪種規(guī)則來比較字符。遺憾的是灶伊，世界各地所使用的規(guī)則是不同的疆前！

# 這意味著在進行不區(qū)分大小寫的匹配時，還是需要知道不同規(guī)則之間的區(qū)別：
i <- c("I", "?", "i", "?")
i
#> [1] "I" "?" "i" "?"
str_subset(i, coll("i", ignore_case = TRUE))
#> [1] "I" "i"
str_subset(
i,
coll("i", ignore_case = TRUE, locale = "tr")
)
#> [1] "?" "i"

fixed() 和regex() 函數(shù)中都有ignore_case 參數(shù)聘萨，但都無法選擇區(qū)域設(shè)置竹椒，它們總是使用默認的區(qū)域設(shè)置。你可以使用以下代碼查看默認區(qū)域設(shè)置（我們稍后會對stringi 包進行更多介紹）：

stringi::stri_locale_info()
#> $Language
#> [1] "en"
#>
#> $Country
#> [1] "US"
#>
#> $Variant
#> [1] ""
#>
#> $Name
#> [1] "en_US"

coll() 函數(shù)的弱點是速度米辐，因為確定哪些是相同字符的規(guī)則比較復雜胸完，與regex() 和fixed()函數(shù)相比，coll()確實比較慢翘贮。

在介紹str_split() 函數(shù)時赊窥，你已經(jīng)知道可以使用boundary()函數(shù)來匹配邊界。你還可以在其他函數(shù)中使用這個函數(shù)：

x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This" "is" "a" "sentence"

6 正則表達式的其他應(yīng)用

R 基礎(chǔ)包中有兩個常用函數(shù)狸页，它們也可以使用正則表達式锨能。
? apropos()函數(shù)可以在全局環(huán)境空間中搜索所有可用對象。當不能確切想起函數(shù)名稱時芍耘，
這個函數(shù)特別有用：

apropos("replace")
#> [1] "%+replace%" "replace" "replace_na"
#> [4] "str_replace" "str_replace_all" "str_replace_na"
#> [7] "theme_replace"

? dir() 函數(shù)可以列出一個目錄下的所有文件腹侣。dir() 函數(shù)的patten 參數(shù)可以是一個正則
表達式，此時它只返回與這個模式相匹配的文件名齿穗。例如傲隶，你可以使用以下代碼返回當
前目錄中的所有R Markdown 文件：

head(dir(pattern = "\\.Rmd$"))
#> [1] "communicate-plots.Rmd" "communicate.Rmd"
#> [3] "datetimes.Rmd" "EDA.Rmd"
#> [5] "explore.Rmd" "factors.Rmd"

7 stringi

stringr 建立于stringi 的基礎(chǔ)之上。stringr 非常容易學習窃页，因為它只提供了非常少的函數(shù)跺株，這些函數(shù)是精挑細選的复濒，可以完成大部分常用字符串操作功能。與stringr 不同乒省，stringi 的設(shè)計思想是盡量全面巧颈，幾乎包含了我們可以用到的所有函數(shù)：stringi 中有234 個函數(shù)，而stringr 中只有42 個袖扛。

如果你發(fā)現(xiàn)某些工作很難使用stringr 來完成砸泛，那么可以考慮使用stringi。因為這兩個包中的函數(shù)的工作方式非常相似蛆封，所以你可以很自然地從stringr 過渡到stringi唇礁。主要區(qū)別是前綴：str_ 與stri_。

Reference

1.https://blog.csdn.net/u011596455/article/details/79600579
2.http://www.reibang.com/p/4790b00dc238

Rdata006 使用stringr處理字符串

1 準備工作

1.1 stringr介紹

1.2 stringr的分類

1.3 stringr包中的重要函數(shù)

1.4 特殊符號

2 字符串基礎(chǔ)

2.1 字符串長度str_length()

2.3 字符串組合str_c()

2.4 字符串取子集str_sub()

2.5 區(qū)域設(shè)置

2.6 練習

2.6.2 用自己的語言描述一下str_c() 函數(shù)的sep 和collapse 參數(shù)有什么區(qū)別？

2.6.3 使用str_length() 和str_sub() 函數(shù)提取出一個字符串最中間的字符恩袱。如果字符串中的

2.6.4 str_wrap() 函數(shù)的功能是什么澈吨？應(yīng)該在何時使用這個函數(shù)把敢？

2.6.5 str_trim() 函數(shù)的功能是什么？其逆操作是哪個函數(shù)棚辽？

3 使用正則表達式進行模式匹配

3.1 str_view() , str_view_all()

3.1 基礎(chǔ)匹配

3.2 練習

3.2.1 解釋一下為什么這些字符串不能匹配一個反斜杠\："\"休弃、"\\"吞歼、"\\\"。

3.2.2 如何匹配字符序列"'\ 慎皱？

3.2.3 正則表達式...... 會匹配哪種模式？如何用字符串來表示這個正則表達式叶骨？

3.3 錨點(^,$)

3.4 練習

3.4.1 如何匹配字符串 "$^$" 忽刽？

3.4.2 給定stringr::words 中的常用單詞語料庫，創(chuàng)建正則表達式以找出滿足下列條件的所有單詞些阅。

3.5 字符類與字符選項

3.6 練習

3.6.1 創(chuàng)建正則表達式來找出符合以下條件的所有單詞汁展。

3.6.2 實際驗證一下規(guī)則：i 總是在e 前面耀销，除非i 前面有c。

3.6.3 q 后面總是跟著一個u 嗎？

3.6.4 編寫一個正則表達式來匹配英式英語單詞熊尉，排除美式英語單詞罐柳。

3.6.5 創(chuàng)建一個正則表達式來匹配你所在國家的電話號碼。

3.7 重復

3.8 練習

3.8.1 給出與?吨娜、+ 和* 等價的{m, n} 形式的正則表達式冬耿。

3.8.2 用語言描述以下正則表達式匹配的是何種模式（仔細閱讀來確認我們使用的是正則表達

3.8.3 創(chuàng)建正則表達式來找出滿足以下條件的所有單詞亦镶。

3.9 分組與回溯引用

3.10 練習

3.10.1 用語言描述以下正則表達式會匹配何種模式栅表？

3.10.2 創(chuàng)建正則表達式來匹配出以下單詞笋鄙。

4 工具

4.1 匹配檢測

4.2 練習

4.3 提取匹配內(nèi)容

4.4 練習

4.4.1 在前面的示例中檬某，你或許已經(jīng)發(fā)現(xiàn)正則表達式匹配了flickered，這并不是一種顏色螟蝙。修

4.4.2 從Harvard sentences 數(shù)據(jù)集中提取以下內(nèi)容胰默。

4.5 分組匹配

4.6 練習

4.6.1 找出跟在一個數(shù)詞（one铐姚、two、three 等）后面的所有單詞肛捍，提取出數(shù)詞與后面的單詞隐绵。

4.6.2 找出所有縮略形式，分別列出撇號前面和后面的部分拙毫。

4.7 替換匹配內(nèi)容

4.8 練習

4.8.1 使用反斜杠替換字符串中的所有斜杠。

4.8.2 使用replace_all() 函數(shù)實現(xiàn)str_to_lower() 函數(shù)的一個簡單版蛀醉。

4.8.3 交換words 中單詞的首字母和末尾字母悬襟，其中哪些字符串仍然是個單詞？

4.9 拆分

4.10 練習

4.10.1 拆分字符串"apples, pears, and bananas"帚桩。

4.10.2 為什么使用boundary("word") 的拆分效果要比" " 好亿驾？

4.10.3 使用空字符串（""）進行拆分會得到什么結(jié)果？嘗試一下账嚎，然后閱讀文檔莫瞬。

4.11 定位匹配內(nèi)容

5 其他類型的模式

正常調(diào)用：

上面形式是以下形式的簡寫

5.1 除了regex()较解，你還可以使用其他3 種函數(shù)畜疾。

6 正則表達式的其他應(yīng)用

7 stringi

Reference

1.3 `stringr`包中的重要函數(shù)

2.1 字符串長度`str_length()`

2.3 字符串組合`str_c()`

2.4 字符串取子集`str_sub()`

3.1 `str_view()` , `str_view_all()`

3.3 錨點(`^`,`$`)

3.6　練習

4.6　練習