1 準備工作
- 加載包
library(tidyverse)
library(stringr)
1.1 stringr介紹
stringr包被定義為一致的刊驴、簡單易用的字符串工具集槽棍。所有的函數(shù)和參數(shù)定義都具有一致性笼才,比如杠输,用相同的方法進行NA處理和0長度的向量處理放案。
字符串處理雖然不是R語言中最主要的功能和蚪,卻也是必不可少的双藕,數(shù)據(jù)清洗淑趾、可視化等的操作都會用到。對于R語言本身的base包提供的字符串基礎(chǔ)函數(shù)忧陪,隨著時間的積累扣泊,已經(jīng)變得很多地方不一致近范,不規(guī)范的命名,不標準的參數(shù)定義延蟹,很難看一眼就上手使用评矩。字符串處理在其他語言中都是非常方便的事情,R語言在這方面確實落后了等孵。stringr包就是為了解決這個問題稚照,讓字符串處理變得簡單易用,提供友好的字符串操作接口俯萌。
stringr的項目主頁:https://cran.r-project.org/web/packages/stringr/index.html
1.2 stringr的分類
1.2.1 字符串拼接函數(shù)
str_c: 字符串拼接果录。
str_join: 字符串拼接,同str_c咐熙。
str_trim: 去掉字符串的空格和TAB(\t)
str_pad: 補充字符串的長度
str_dup: 復制字符串
str_wrap: 控制字符串輸出格式
str_sub: 截取字符串
str_sub<- 截取字符串弱恒,并賦值,同str_sub
1.2.2 字符串計算函數(shù)
str_count: 字符串計數(shù)
str_length: 字符串長度
str_sort: 字符串值排序
str_order: 字符串索引排序棋恼,規(guī)則同str_sort
1.2.3 字符串匹配函數(shù)
str_split: 字符串分割
str_split_fixed: 字符串分割返弹,同str_split
str_subset: 返回匹配的字符串
word: 從文本中提取單詞
str_detect: 檢查匹配字符串的字符
str_match: 從字符串中提取匹配組。
str_match_all: 從字符串中提取匹配組爪飘,同str_match
str_replace: 字符串替換
str_replace_all: 字符串替換义起,同str_replace
str_replace_na:把NA替換為NA字符串
str_locate: 找到匹配的字符串的位置。
str_locate_all: 找到匹配的字符串的位置,同str_locate
str_extract: 從字符串中提取匹配字符
str_extract_all: 從字符串中提取匹配字符师崎,同str_extract
1.2.3 字符串變換函數(shù)
str_conv: 字符編碼轉(zhuǎn)換
str_to_upper: 字符串轉(zhuǎn)成大寫
str_to_lower: 字符串轉(zhuǎn)成小寫,規(guī)則同str_to_upper
str_to_title: 字符串轉(zhuǎn)成首字母大寫,規(guī)則同str_to_upper
1.2.3 參數(shù)控制函數(shù)默终,僅用于構(gòu)造功能的參數(shù),不能獨立使用
boundary: 定義使用邊界
coll: 定義字符串標準排序規(guī)則犁罩。
fixed: 定義用于匹配的字符齐蔽,包括正則表達式中的轉(zhuǎn)義符
regex: 定義正則表達式
1.3 stringr
包中的重要函數(shù)
函數(shù) | 功能說明 | R Base中對應(yīng)函數(shù) |
---|---|---|
使用正則表達式的函數(shù) | ||
str_extract() |
提取首個匹配模式的字符 | regmatches() |
str_extract_all() |
提取所有匹配模式的字符 | regmatches() |
str_locate() |
返回首個匹配模式的字符的位置 | regexpr() |
str_locate_all() |
返回所有匹配模式的字符的位置 | gregexpr() |
str_replace() |
替換首個匹配模式 | sub() |
str_replace_all() |
替換所有匹配模式 | gsub() |
str_split() |
按照模式分割字符串 | strsplit() |
str_split_fixed() |
按照模式將字符串分割成指定個數(shù) | - |
str_detect() |
檢測字符是否存在某些指定模式 | grepl() |
str_count() |
返回指定模式出現(xiàn)的次數(shù) | - |
其他重要函數(shù) | ||
str_sub() |
提取指定位置的字符 | regmatches() |
str_dup() |
丟棄指定位置的字符 | - |
str_length() |
返回字符的長度 | nchar() |
str_pad() |
填補字符 | - |
str_trim() |
丟棄填充,如去掉字符前后的空格 | - |
str_c() |
連接字符 | paste(),paste0() |
1.4 特殊符號
-
.
床估,^
含滴,$
,*
丐巫,+
谈况,?
,[
递胧,]
碑韵,(
,)
谓着,{
,}
坛掠,\
和/
必須使用\
作為轉(zhuǎn)義 - 可以使用?'"' 或?"'"調(diào)出幫助文件來查看完整的特殊字符列表赊锚,匹配\n需要構(gòu)建"\n"正則表達式
?"'"
?'"'
# \n newline
# \r carriage return
# \t tab
# \b backspace
# \a alert (bell)
# \f form feed
# \v vertical tab
# \\ backslash \
# \' ASCII apostrophe '
# \" ASCII quotation mark "
# \` ASCII grave accent (backtick) `
# \nnn character with given octal code (1, 2 or 3 digits)
# \xnn character with given hex code (1 or 2 hex digits)
# \unnnn Unicode character with given code (1--4 hex digits)
# \Unnnnnnnn Unicode character with given code (1--8 hex digits)
2 字符串基礎(chǔ)
# 可以使用單引號或雙引號來創(chuàng)建字符串治筒。與其他語言不同,單引號和雙引號在R 中沒有區(qū)
別舷蒲。我們推薦使用"
string1 <- "This is a string"
string2 <- 'To put a "quote" inside a string, use single quotes'
# 如果忘記了結(jié)尾的引號耸袜,你會看到一個續(xù)行符`+`,如果遇到了這種情況牲平,可以按`Esc`鍵堤框,然后重新輸入
# 如果想要在字符串中包含一個單引號或雙引號,可以使用`\` 對其進行“轉(zhuǎn)義”:
(double_quote <- "\"" )
# [1] "\""
('"')
# [1] "\""
(single_quote <- '\'')
# [1] "'"
("'")
# [1] "'"
# 如果想要在字符串中包含一個反斜杠纵柿,就需要使用兩個反斜杠:\\
(x <- c("\"", "\\"))
# [1] "\"" "\\"
writeLines(x)
# "
# \
# 字符串的打印形式與其本身的內(nèi)容不是相同的蜈抓,因為**打印形式中會顯示出轉(zhuǎn)義字符**。如果想要查看字符串的初始內(nèi)容昂儒,可以使用`writelines() `函數(shù)
x <- "\u00b5"
x
# [1] "μ"
2.1 字符串長度str_length()
str_length(c("a", "R for data science", NA))
# [1] 1 18 NA
2.3 字符串組合str_c()
# 要想組合兩個或更多字符串沟使,可以使用str_c() 函數(shù)
str_c("x", "y")
# [1] "xy"
str_c("x", "y", "z")
# [1] "xyz"
# 可以使用sep 參數(shù)來控制字符串間的分隔方式:
str_c("x", "y", sep = "_")
# [1] "x_y"
# 和多數(shù)R 函數(shù)一樣,缺失值是可傳染的渊跋。如果想要將它們輸出為"NA"腊嗡,可以使用str_
replace_na():
x <- c("abc", NA)
x
# [1] "abc" NA
str_c("|-", x, "-|")
# [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
# [1] "|-abc-|" "|-NA-|"
# str_c() 函數(shù)是向量化的,它可以自動循環(huán)短向量拾酝,使得其與最長的向量具有相同的長度
str_c("prefix-", c("a", "b", "c"), "-suffix")
# [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
# 要想將字符向量合并為字符串燕少,可以使用collapse() 函數(shù):
str_c(c("x", "y", "z"), collapse = ", ")
[1] "x, y, z"
2.4 字符串取子集str_sub()
可以使用str_sub() 函數(shù)來提取字符串的一部分。除了字符串參數(shù)外蒿囤,str_sub() 函數(shù)中還
有start 和end 參數(shù)客们,它們給出了子串的位置(包括start 和end 在內(nèi)):
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
# [1] "App" "Ban" "Pea"
# 負數(shù)表示從后往前數(shù)
str_sub(x, -3, -1)
# [1] "ple" "ana" "ear"
# 即使字符串過短,str_sub() 函數(shù)也不會出錯蟋软,它將返回盡可能多的字符:
str_sub("a", 1, 5)
# [1] "a"
# 還可以使用str_sub() 函數(shù)的賦值形式來修改字符串:
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple" "banana" "pear"
2.5 區(qū)域設(shè)置
# 文本大小寫轉(zhuǎn)換 str_to_lower() ,str_to_upper() ,str_to_title() 函數(shù)
# 因為不同的語言有不同的轉(zhuǎn)換規(guī)則镶摘,使得大小寫轉(zhuǎn)換比較復雜
# 土耳其語中有帶點和不帶點的兩個i,它們在轉(zhuǎn)換為大寫時是不同的:
str_to_upper(c("i", "?"))
#> [1] "I" "I"
str_to_upper(c("i", "?"), locale = "tr")
#> [1] "?" "I"
x <- c("apple", "eggplant", "banana")
# 英語
str_sort(x, locale = "en")
#> [1] "apple" "banana" "eggplant"
# 夏威夷語
str_sort(x, locale = "haw")
#> [1] "apple" "eggplant" "banana"
2.6 練習
2.6.1 在沒有使用stringr 的那些代碼中岳守,你會經(jīng)称喔遥看到paste() 和paste0() 函數(shù),這兩個函數(shù)的區(qū)別是什么湿痢? stringr 中的哪兩個函數(shù)與它們是對應(yīng)的涝缝?這些函數(shù)處理NA 的方式有什么不同?
用法
paste (..., sep = " ", collapse = NULL)
paste0(..., collapse = NULL)
用法等同于str_c()實例
# sep譬重,"字符串"分隔符拒逮,用于字符串間連接
# collapse,"向量"分隔符臀规,用于向量元素連接
# paste()默認使用空格" "連接
paste("abc","def","ghi")
# [1] "abc def ghi"
paste("abc","def","ghi",sep = ".")
# [1] "abc.def.ghi"
# collapse未發(fā)揮作用
paste("abc","def","ghi",collapse = ".")
# [1] "abc def ghi"
paste(c("abc","def","ghi"),collapse = ".")
# [1] "abc.def.ghi"
# paste0()默認無間隔連接
paste0("abc","def","ghi")
# [1] "abcdefghi"
# sep參數(shù)不適用
paste0("abc","def","ghi",sep = ".")
# [1] "abcdefghi."
# paste0()函數(shù)collapse參數(shù)不適用單個對象
paste0("abc","def","ghi",collapse = ".")
# [1] "abcdefghi"
# paste0()函數(shù)collapse參數(shù)適用于對象
paste0(c("abc","def","ghi"),collapse = ".")
# "abc.def.ghi"
# str_c()函數(shù)NA會傳染滩援,paste/paste0不會
paste("abc","def","ghi",NA)
# [1] "abc def ghi NA"
paste0("abc","def","ghi",NA)
# [1] "abcdefghiNA"
str_c("abc","def","ghi",NA)
# [1] NA
2.6.2 用自己的語言描述一下str_c() 函數(shù)的sep 和collapse 參數(shù)有什么區(qū)別?
# sep塔嬉,"字符串"分隔符玩徊,用于字符串間連接
# collapse租悄,"向量"分隔符,用于向量元素連接
# str_c()默認無間隔連接
str_c("abc","def","ghi",sep = ".")
# [1] "abc.def.ghi"
str_c("abc","def","ghi",collapse = ".")
# [1] "abcdefghi"
str_c(c("abc","def","ghi"),sep = ".")
# [1] "abc" "def" "ghi"
str_c(c("abc","def","ghi"),collapse = ".")
# [1] "abc.def.ghi"
2.6.3 使用str_length() 和str_sub() 函數(shù)提取出一個字符串最中間的字符恩袱。如果字符串中的
字符數(shù)是偶數(shù)泣棋,你應(yīng)該怎么做?
# floor:向下取整畔塔,即不大于該數(shù)字的最大整數(shù)
# ceiling:向上取整潭辈,即不小于該數(shù)字的最小整數(shù)
# trunc:取整數(shù)部分
# round:保留幾位小數(shù)
# signif:保留幾位有效數(shù)字,常用于科學技術(shù)
x <- c("a", "abc", "abcd", "abcde", "abcdef")
# 統(tǒng)計字符串長度
(len <- str_length(x))
(m <- ceiling(l/2))
(n <- floor(l/2))
# 利用求余符號"%%"判斷字符串奇偶
ifelse (len%%2 !=0,str_sub(x, m, m),str_sub(x,n,n+1))
# if_else (len%%2 !=0,str_sub(x, m, m),str_sub(x,n,n+1))
# [1] "a" "b" "bc" "c" "cd"
2.6.4 str_wrap() 函數(shù)的功能是什么澈吨?應(yīng)該在何時使用這個函數(shù)把敢?
- 轉(zhuǎn)換字符串輸出格式
str_wrap(string, width = 80, indent = 0, exdent = 0)
參數(shù) | Arguments |
---|---|
string | 重新格式化字符串的字符向量 |
width | 目標行的字符寬度 |
indent | 每段首行縮進 |
exdent | 每個段落縮進 |
Value | 字符向量的格式化字符串 |
thanks <- str_c(readLines(R.home("doc/THANKS")), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
cat(str_wrap(thanks, width = 40), "\n")
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")
2.6.5 str_trim() 函數(shù)的功能是什么?其逆操作是哪個函數(shù)棚辽?
# 去除空格
str_trim(" ab cd ")
# [1] "ab cd"
str_trim(" ab cd ","both")
# [1] "ab cd"
str_trim(" ab cd ","left")
# [1] "ab cd "
str_trim(" ab cd ","right")
# [1] " ab cd"
2.6.6 編寫一個函數(shù)將字符向量轉(zhuǎn)換為字符串技竟,例如,將字符向量c("a", "b", "c") 轉(zhuǎn)換為字符串a(chǎn)屈藐、b 和c榔组。仔細思考一下,如果給定一個長度為0联逻、1 或2 的向量搓扯,那么這個函數(shù)應(yīng)該怎么做?
str_commasep <- function(x, delim = ",") {
n <- length(x)
if (n == 0) {
""
} else if (n == 1) {
x
} else if (n == 2) {
# no comma before and when n == 2
str_c(x[[1]], "and", x[[2]], sep = " ")
} else {
# commas after all n - 1 elements
not_last <- str_c(x[seq_len(n - 1)], delim)
# prepend "and" to the last element
last <- str_c("and", x[[n]], sep = " ")
# combine parts with spaces
str_c(c(not_last, last), collapse = " ")
}
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"
3 使用正則表達式進行模式匹配
3.1 str_view()
, str_view_all()
str_view() 和str_view_all() 函數(shù)來學習正則表達式包归。這兩個函數(shù)接受一個字符向量和一個正則表達式锨推,并顯示出它們是如何匹配的
- str_view() 單個匹配
- str_view_all() 全部匹配
3.1 基礎(chǔ)匹配
x <- c("apple", "banana", "pear")
str_view(x, "an")
# 通配符".",它可以匹配任意字符(除了換行符):
str_view(x, ".a.")
和字符串一樣公壤,正則表達式也使用反斜杠來去除某些字符的特殊含義换可。因此,如果要匹配.厦幅,那么你需要的正則表達式就是\.沾鳄。但是\ 在字符串中也用作轉(zhuǎn)義字符,所以正則表達式\. 的字符串形式應(yīng)是\\.:
# 檢索向量中的"a.c"
str_view(c("abc", "a.c", "bef"), "a\\.c")
# "\"在正則表達式中用作轉(zhuǎn)義字符确憨,為匹配"\"這個字符需要建立形式為"\\"的正則表達式
(x <- "a\\b")
# [1] "a\\b"
writeLines(x)
#> a\b
# "\"需要轉(zhuǎn)義译荞,因此為了匹配賦值表達式的"\\",需要轉(zhuǎn)義兩個"\",也即"\\\\"
str_view(x, "\\\\")
#> a\b
# 第二個"被\轉(zhuǎn)義,所以在輸入的時候會提示續(xù)行符`+`
a1 <- "\"
a2 <- "\\"
writeLines(a2)
# \
# 奇數(shù)個反斜杠‘\’,結(jié)尾的"被\轉(zhuǎn)義,所以在輸入的時候依然會提示續(xù)行符`+`
a3 <- "\\\"
#
a4 <- "\\\\"
writeLines(a4)
# \\
3.2 練習
3.2.1 解釋一下為什么這些字符串不能匹配一個反斜杠\:"\"休弃、"\\"吞歼、"\\\"。
奇數(shù)個反斜杠"\"最后一個反斜杠"\"會對結(jié)尾的雙引號進行轉(zhuǎn)義塔猾,然后會提示續(xù)行符"+"
而為了匹配反斜杠\篙骡,需要建立\的正則表達式,每個反斜杠均需要轉(zhuǎn)義,因此匹配一個反斜杠糯俗,需要輸入四個反斜杠
3.2.2 如何匹配字符序列"'\ 慎皱?
x <- "\"\'\\"
x
# [1] "\"'\\"
writeLines(x)
# "'\
3.2.3 正則表達式...... 會匹配哪種模式?如何用字符串來表示這個正則表達式叶骨?
y <- "\..\..\.."
# Error: '\.' is an unrecognized escape in character string starting ""\."
3.3 錨點(^
,$
)
-
^
從字符串開頭進行匹配。 -
$
從字符串末尾進行匹配祈匙。
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
str_view(x, "^apple$")
3.4 練習
3.4.1 如何匹配字符串 "$^$" 忽刽?
# 字符串賦值
y <- "???$^$**&&"
# 錨點"^","$"為特殊符號,需要轉(zhuǎn)義"\"夺欲,但是轉(zhuǎn)義符號也是特殊符號跪帝,也是需要轉(zhuǎn)義
str_view(y,"\\$\\^\\$")
3.4.2 給定stringr::words 中的常用單詞語料庫,創(chuàng)建正則表達式以找出滿足下列條件的所有單詞些阅。
a. 以y 開頭的單詞伞剑。
b. 以x 結(jié)尾的單詞。
c. 長度正好為3 個字符的單詞市埋。(不要使用str_length() 函數(shù)黎泣,這是作弊!)
d. 具有7個或更多字符的單詞缤谎。
因為這個列表非常長抒倚,所以你可以設(shè)置str_view() 函數(shù)的match 參數(shù),只顯示匹配的
單詞(match = TRUE)或未匹配的單詞(match = FALSE)坷澡。
# 以y 開頭的單詞
str_view(words,"^y",match = T)
table(str_count(words,"^y"))
# 0 1
# 974 6
# 以x 結(jié)尾的單詞
str_view(words,"x$",match = T)
table(str_count(words,"x$"))
# 0 1
# 976 4
# 長度正好為3 個字符的單詞
str_view(words,"^(...)$",match = T)
table(str_count(words,"^(...)$"))
# 0 1
# 870 110
table(str_length(words))
# 1 2 3 4 5 6 7 8 9 10 11
# 1 18 110 263 200 169 119 57 30 9 4
# 具有7個或更多字符的單詞,括號內(nèi)連續(xù)七個點"."
str_view(words,"(.......)",match = T)
3.5 字符類與字符選項
# "."托呕,它可以匹配除"換行符"外的任意字符
# \d 可以匹配任意數(shù)字
# \s 可以匹配任意空白字符(如空格、制表符和換行符)
# [abc] 可以匹配a频敛、b 或c
# [^abc] 可以匹配除a项郊、b、c 外的任意字符
# "|" 的優(yōu)先級很低斟赚,所以abc|xyz 匹配的是abc 或xyz
# 可以使用括號讓"|"表達得更清晰一些,str_view(c("grey", "gray"), "gr(e|a)y")
要想創(chuàng)建包含\d
或\s
的正則表達式着降,你需要在字符串中對\
進行轉(zhuǎn)義,因此需要輸入"\\d"
或"\\s"
3.6 練習
3.6.1 創(chuàng)建正則表達式來找出符合以下條件的所有單詞汁展。
a. 以元音字母開頭的單詞鹊碍。
b. 只包含輔音字母的單詞(提示:考慮一下匹配“非”元音字母)。
c. 以ed 結(jié)尾食绿,但不以eed 結(jié)尾的單詞侈咕。
d. 以ing 或ize 結(jié)尾的單詞。
# 以元音字母開頭的單詞
str_view(words,"^[aoeiu]",match = T)
# 只包含輔音字母的單詞(提示:考慮一下匹配“非”元音字母)
str_view(words,"^[^aoeiu]",match = T)
# 以ed 結(jié)尾器紧,但不以eed 結(jié)尾的單詞
str_view(words,"([^e]ed)$",match = T)
str_view(words,"[^e]ed$",match = T)
# 以ing 或ize 結(jié)尾的單詞
str_view(words,"((ing)|(ize))$",match = T)
3.6.2 實際驗證一下規(guī)則:i 總是在e 前面耀销,除非i 前面有c。
str_view(words,"((cei)|[^c]ie)",match = T)
3.6.3 q 后面總是跟著一個u 嗎?
table(ifelse(str_detect(words,"qu"),"Yes","NO"))
# NO Yes
# 970 10
3.6.4 編寫一個正則表達式來匹配英式英語單詞熊尉,排除美式英語單詞罐柳。
3.6.5 創(chuàng)建一個正則表達式來匹配你所在國家的電話號碼。
# 賦值
(x <- c("+86-18217047048","1223333333","10545384333"))
# 正則表達式匹配狰住,所有的特殊字符需要轉(zhuǎn)義张吉,如"+"需要構(gòu)建正則表達式"\+",但是\也需要轉(zhuǎn)義催植,所以需要輸入"\\+"肮蛹,同理"\\d"
# str_view(x,".86-1[89]\\d\\d\\d\\d\\d\\d\\d\\d\\d")
str_view(x,"\\+86-1[89]\\d\\d\\d\\d\\d\\d\\d\\d\\d")
3.7 重復
正則表達式的另一項強大功能是,其可以控制一個模式能夠匹配多少次创南。
-
?
:0 次或1 次伦忠。 -
+
:1 次或多次。 -
*
:0 次或多次稿辙。 -
{n}
:匹配n 次昆码。 -
{n,}
:匹配n 次或更多次。 -
{,m}
:最多匹配m 次邻储。 -
{n, m}
:匹配n 到m 次赋咽。
x <- "MDCCCCCCCLXXXCCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view_all(x, "CC+")
str_view(x, "C[LX]+")
str_view_all(x, "C[LX]+")
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
# 默認的正則表達式會匹配盡可能長的字符串,但是在正則表達式后加"?"則可以將匹配方式改為匹配盡可能短的字符串
str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')
3.8 練習
3.8.1 給出與?吨娜、+ 和* 等價的{m, n} 形式的正則表達式冬耿。
# `?`:0 次或1 次
str_view("aaaaabbaabb","a?")
str_view("aaaaabbaabb","a{0,1}")
#`+`:1 次或多次
str_view("aaaaaabbaabb","a+")
str_view("aaaaaabbaabb","a{1,}")
# `* `:0 次或多次
str_view("aaaaaabbaabb","a*")
str_view("aaaaaabbaabb","a{0,}")
3.8.2 用語言描述以下正則表達式匹配的是何種模式(仔細閱讀來確認我們使用的是正則表達
式,還是定義正則表達式的字符串)萌壳?
a. ^.*$
b. "\{.+\}"
c. \d{4}-\d{2}-\d{2}
d. "\\{4}"
3.8.3 創(chuàng)建正則表達式來找出滿足以下條件的所有單詞亦镶。
a. 以3 個輔音字母開頭的單詞。
b. 有連續(xù)3 個或更多元音字母的單詞袱瓮。
c. 有連續(xù)2 個或更多元音—輔音配對的單詞缤骨。
str_view(words,"^([^aoieu]{3})",match = T)
str_view(words,"[aoieu]{3,}",match = T)
str_view(words,"([aoieu][^aoieu]){2,}",match = T)
3.9 分組與回溯引用
括號還可以定義“分組”,可以通過回溯引用(如\1尺借、\2 等)來引用這些分組
# "\1"代表正則表達式(..)绊起,為匹配"\1",需輸入 "\\1"
str_view(fruit, "(..)\\1", match = TRUE)
# 括號()代表分組燎斩,\\n代表引用第n個括號()
# (..)\\1(...)(...)\\3虱歪,其中\(zhòng)\1代表重復引用第1個括號(..),\\3重復引用第3個括號(...)
str_view("abcdcdxyzefgefgefghijk", "(..)\\1", match = TRUE)
str_view("abcdcdxyzefgefgefghijk", "(..)\\1(...)(...)\\3", match = TRUE)
3.10 練習
3.10.1 用語言描述以下正則表達式會匹配何種模式栅表?
a. (.)\1\1
b. "(.)(.)\2\1"
c. (..)\1
d. "(.).\1.\1"
e. "(.)(.)(.).*\3\2\1"
3.10.2 創(chuàng)建正則表達式來匹配出以下單詞笋鄙。
a. 開頭字母和結(jié)尾字母相同的單詞。
b. 包含一對重復字母的單詞(例如怪瓶,church 中包含了重復的ch)萧落。
c. 包含一個至少重復3 次的字母的單詞(例如,eleven 中的e 重復了3 次)。
# (\\1?$)為匹配含一個字符的字符串
str_view(words,"^([A-Za-z])((.*(\\1$))|(\\1?$))", match = TRUE)
str_view(words,"([A-Za-z][A-Za-z])(.*)\\1", match = TRUE)
str_view(words,"([A-Za-z])(.*)\\1(.*)\\1", match = TRUE)
4 工具
利用正則表達式多種stringr 函數(shù)找岖,可以:
- 確定與某種模式相匹配的字符串陨倡;
- 找出匹配的位置;
- 提取出匹配的內(nèi)容许布;
- 使用新值替換匹配內(nèi)容兴革;
- 基于匹配拆分字符串。
4.1 匹配檢測
-
str_detect()
要想確定一個字符向量能否匹配一種模式蜜唾,可以使用str_detect() 函數(shù)帖旨。它返回一個與輸入向量具有同樣長度的邏輯向量
x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE
# 從數(shù)學意義上來說,邏輯向量中的FALSE 為0灵妨,TRUE 為1。這使得在匹配特別大的向量時落竹,sum() 和mean() 函數(shù)能夠發(fā)揮更大的作用
# 統(tǒng)計以t開頭的常用單詞
sum(str_detect(words, "^t"))
#> [1] 65
# 計算以元音字母結(jié)尾的常用單詞的比例
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277
# 當邏輯條件非常復雜時泌霍,使用邏輯運算符將多個str_detect() 調(diào)用組合起來會更容易
# 找出至少包含一個元音字母的所有單詞,然后取反
no_vowels_1 <- !str_detect(words, "[aeiou]")
# 找出僅包含輔音字母(非元音字母)的所有單詞
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
#> [1] TRUE
str_subset()
# str_detect() 函數(shù)的一種常見用法是選取出匹配某種模式的元素述召。你可以通過邏輯取子集方式來完成這種操作朱转,也可以使用便捷的str_subset() 包裝器函數(shù)
# c[],表示取向量的的元素
words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"
# 字符串通常會是數(shù)據(jù)框的一列积暖,可以使用filter 操作
df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(words, "x$"))
# # A tibble: 4 x 2
# word i
# <chr> <int>
# 1 box 108
# 2 sex 747
# 3 six 772
# 4 tax 841
# Warning message:
# `...` is not empty.
-
str_count()
str_detect() 函數(shù)的一種變體藤为,不簡單地返回是或否,而是返回字符串中匹配的數(shù)量:
x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均來看夺刑,每個單詞中有多少個元音字母缅疟?
mean(str_count(words, "[aeiou]"))
#> [1] 1.99
str_count() 也完全可以同mutate() 函數(shù)一同使用:
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)
很多stringr 函數(shù)都是成對出現(xiàn)的:一個函數(shù)用于單個匹配,另一個函數(shù)用于全部匹配遍愿,后者會有后綴_all存淫。
4.2 練習
試著使用兩種方法來解決以下每個問題,一種方法是使用單個正則表達式沼填,另一種方法是使用多個str_detect() 函數(shù)的組合桅咆。
a. 找出以x 開頭或結(jié)尾的所有單詞。
b. 找出以元音字母開頭并以輔音字母結(jié)尾的所有單詞坞笙。
c. 是否存在包含所有元音字母的單詞岩饼?
d. 哪個單詞包含最多數(shù)量的元音字母?哪個單詞包含最大比例的元音字母薛夜?(提示:
分母應(yīng)該是什么籍茧?)
# 以x 開頭或結(jié)尾
words[str_detect(words,"(^x(.*))|((.*)x$)")]
str_subset(words,"(^x(.*))|((.*)x$)")
# 以元音字母開頭并以輔音字母結(jié)尾
words[str_detect(words,"^[aoiue](.*)[^aoiue]$")]
words[(str_detect(words,"^[aoiue](.*)"))&(str_detect(words,"(.*)[^aoiue]$"))]
str_subset(words,"^[aoiue](.*)[^aoiue]$")
# 包含所有元音字母的單詞
words[str_detect(words, "a") &
str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o") &
str_detect(words, "u")
]
# character(0)
#包含最多數(shù)量的元音字母
words[
which(
str_count(words, "[aeiou]") == max(str_count(words, "[aeiou]"))
)
]
# words[str_count(words, "[aeiou]") == max(str_count(words, "[aeiou]"))]
# [1] "appropriate" "associate" "available" "colleague" "encourage"
# [6] "experience" "individual" "television"
4.3 提取匹配內(nèi)容
要想提取匹配的實際文本,我們可以使用str_extract()
函數(shù)梯澜。
stringr::sentences
length(sentences)
# [1] 720
head(sentences)
# [1] "The birch canoe slid on the smooth planks."
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."
# [4] "These days a chicken leg is a rare dish."
# [5] "Rice is often served in round bowls."
# [6] "The juice of lemons makes fine punch."
# 創(chuàng)建一個顏色名稱向量硕糊,然后將其轉(zhuǎn)換成一個正則表達式:
colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
# [1] "red|orange|yellow|green|blue|purple"
# 選取出包含一種顏色的句子,再從中提取出顏色
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"
more <- sentences[str_count(sentences,color_match)>1]
str_view_all(more,color_match)
str_extract(more,color_match)
# [1] "blue" "green" "orange"
str_extract_all(more,color_match)
# [[1]]
# [1] "blue" "red"
#
# [[2]]
# [1] "green" "red"
#
# [[3]]
# [1] "orange" "red"
# 如果設(shè)置了simplify = TRUE,那么str_extract_all() 會返回一個矩陣简十,其中較短的匹配會擴展到與最長的匹配具有同樣的長度:
str_extract_all(more, color_match, simplify = TRUE)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "green" "red"
#> [3,] "orange" "red"
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "a" "" ""
#> [2,] "a" "b" ""
#> [3,] "a" "b" "c"
4.4 練習
4.4.1 在前面的示例中檬某,你或許已經(jīng)發(fā)現(xiàn)正則表達式匹配了flickered,這并不是一種顏色螟蝙。修
改正則表達式來解決這個問題恢恼。
colors <- c(
"red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|")
color_match
[1] "red|orange|yellow|green|blue|purple"
# 修改
## 我們在 10.3.3 中錨點部分提到" \b:"表示匹配單詞間的邊界。
color_match2 <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")
color_match2
# [1] "\\b(red|orange|yellow|green|blue|purple)\\b"
more2 <- sentences[str_count(sentences, color_match2) > 1]
str_view_all(more2, color_match2, match = TRUE)
4.4.2 從Harvard sentences 數(shù)據(jù)集中提取以下內(nèi)容胰默。
a. 每個句子的第一個單詞场斑。
b. 以ing 結(jié)尾的所有單詞。
c. 所有復數(shù)形式的單詞。
# 每個句子的第一個單詞
str_extract(sentences, "[A-Za-z]+") %>%
head()
# [1] "The" "Glue" "It" "These" "Rice" "The"
# 以 ing 結(jié)尾的所有單詞午乓。
pattern <- "\\b[A-Za-z]+ing\\b"
sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern)))
4.5 分組匹配
括號在正則表達式中可以闡明優(yōu)先級珠插,還能對正則表達式進行分組,分組可以在匹配時回溯引用青责,還可以使用括號來提取一個復雜匹配的各個部分。
# 找出跟在a 或the 后面的所有單詞
# 直接使用正則表達式定義“單詞”有一點難度取具,但是可以通過一種簡單的近似定義"至少有1 個非空格字符的字符序列"
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
#> [1] "the smooth" "the sheet" "the depth" "a chicken"
#> [5] "the parked" "the sun" "the huge" "the ball"
#> [9] "the woman" "a helps"
- str_extract() 函數(shù)可以給出完整匹配脖隶;
-str_match() 函數(shù)則可以給出每個獨立分組。str_match() 返回的不是字符向量暇检,而是一個矩陣产阱,其中一列是完整匹配,后面的列是每個分組的匹配:
str_match(has_noun,nonu)
# or has_noun %>% str_match(noun)
# [,1] [,2] [,3]
# [1,] "the smooth" "the" "smooth"
# [2,] "the sheet" "the" "sheet"
# [3,] "the depth" "the" "depth"
# [4,] "a chicken" "a" "chicken"
# [5,] "the parked" "the" "parked"
# [6,] "the sun" "the" "sun"
# [7,] "the huge" "the" "huge"
# [8,] "the ball" "the" "ball"
# [9,] "the woman" "the" "woman"
# [10,] "a helps" "a" "helps"
- 如果數(shù)據(jù)是保存在tibble 中的块仆,那么使用
tidyr::extract()
會更容易构蹬。這個函數(shù)的工作方式與str_match()
函數(shù)類似,只是要求為每個分組提供一個名稱悔据,以作為新列放在tibble 中:
tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)
#> # A tibble: 720 × 3
#> sentence article noun
#> * <chr> <chr> <chr>
#> 1 The birch canoe slid on the smooth planks. the smooth
#> 2 Glue the sheet to the dark blue background. the sheet
#> 3 It's easy to tell the depth of a well. the depth
#> 4 These days a chicken leg is a rare dish. a chicken
#> 5 Rice is often served in round bowls. <NA> <NA>
#> 6 The juice of lemons makes fine punch. <NA> <NA>
#> # ... with 714 more rows
與str_extract()
函數(shù)一樣怎燥,如果想要找出每個字符串的所有匹配,你需要使用str_match_all()
函數(shù)蜜暑。
4.6 練習
4.6.1 找出跟在一個數(shù)詞(one铐姚、two、three 等)后面的所有單詞肛捍,提取出數(shù)詞與后面的單詞隐绵。
# \b:單詞邊界
# \w:任意單詞字符
# \W:任意非單詞字符
numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences[str_detect(sentences, numword)] %>% str_extract(numword)
4.6.2 找出所有縮略形式,分別列出撇號前面和后面的部分拙毫。
contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences[str_detect(sentences, contraction)] %>%
str_extract(contraction) %>%
str_split("'")
4.7 替換匹配內(nèi)容
str_replace()
和str_replace_all()
函數(shù)可以使用新字符串替換匹配內(nèi)容依许。最簡單的應(yīng)用
是使用固定字符串替換匹配內(nèi)容:
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"
通過提供一個命名向量,使用str_replace_all()
函數(shù)可以同時執(zhí)行多個替換:
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"
除了使用固定字符串替換匹配內(nèi)容缀蹄,你還可以使用回溯引用來插入匹配中的分組峭跳。在下面
的代碼中膘婶,我們交換了第二個單詞和第三個單詞的順序:
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."
4.8 練習
4.8.1 使用反斜杠替換字符串中的所有斜杠。
x <- c("a/b","a/b/c")
(str_replace_all(x,"/","\\\\"))
# [1] "a\\b" "a\\b\\c"
writeLines(str_replace_all(x,"/","\\\\"))
# a\b
# a\b\c
4.8.2 使用replace_all() 函數(shù)實現(xiàn)str_to_lower() 函數(shù)的一個簡單版蛀醉。
LETTERS2letters <- letters
names(LETTERS2letters) <- LETTERS
str_replace_all(words, LETTERS2letters)
4.8.3 交換words 中單詞的首字母和末尾字母悬襟,其中哪些字符串仍然是個單詞?
4.9 拆分
str_split()
函數(shù)可以將字符串拆分為多個片段拯刁。
# 將句子拆分成單詞
# 字符向量的每個分量會包含不同數(shù)量的片段脊岳,所以str_split() 會返回一個列表
y <- str_split(head(sentences,5)," ")
# [[1]]
# [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
# [8] "planks."
#
# [[2]]
# [1] "Glue" "the" "sheet" "to" "the"
# [6] "dark" "blue" "background."
#
# [[3]]
# [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
#
# [[4]]
# [1] "These" "days" "a" "chicken" "leg" "is" "a"
# [8] "rare" "dish."
#
# [[5]]
# [1] "Rice" "is" "often" "served" "in" "round" "bowls."
y[[1]][1]
# [1] "The"
# 如果你拆分的是長度為1 的向量,那么只要簡單地提取列表的第一個元素即可:
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
#> [1] "a" "b" "c" "d"
# 通過設(shè)置simplify = TRUE 返回一個矩陣
str_split(head(sentences,5)," ", simplify = TRUE)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks." ""
# [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background." ""
# [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
# [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare" "dish."
# [5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
# 還可以設(shè)定拆分片段的最大數(shù)量
str_split(head(sentences,5)," ", n=2,simplify = TRUE)
# [,1] [,2]
# [1,] "The" "birch canoe slid on the smooth planks."
# [2,] "Glue" "the sheet to the dark blue background."
# [3,] "It's" "easy to tell the depth of a well."
# [4,] "These" "days a chicken leg is a rare dish."
# [5,] "Rice" "is often served in round bowls."
# 除了模式垛玻,還可以通過"字母割捅、行、句子和單詞邊界"(boundary() 函數(shù))來拆分字符串
# boundary(type = c("character", "line_break", "sentence", "word"), skip_word_none = NA, ...)
str_split(head(sentences,5),boundary("word"))
# [[1]]
# [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks"
#
# [[2]]
# [1] "Glue" "the" "sheet" "to" "the" "dark" "blue"
# [8] "background"
#
# [[3]]
# [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well"
#
# [[4]]
# [1] "These" "days" "a" "chicken" "leg" "is" "a" "rare" "dish"
#
# [[5]]
# [1] "Rice" "is" "often" "served" "in" "round" "bowls"
str_split(head(sentences,5),boundary("word"),simplify = TRUE)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks" ""
# [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background" ""
# [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well"
# [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare" "dish"
# [5,] "Rice" "is" "often" "served" "in" "round" "bowls" "" ""
str_split(head(sentences,5),boundary("word"),n=2,simplify = TRUE)
# [,1] [,2]
# [1,] "The" "birch canoe slid on the smooth planks."
# [2,] "Glue" "the sheet to the dark blue background."
# [3,] "It's" "easy to tell the depth of a well."
# [4,] "These" "days a chicken leg is a rare dish."
# [5,] "Rice" "is often served in round bowls."
sentences %>%
head(2) %>%
str_split(boundary("character"))
# [[1]]
# [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i" "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t"
# [34] "h" " " "p" "l" "a" "n" "k" "s" "."
#
# [[2]]
# [1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t" "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b"
# [34] "a" "c" "k" "g" "r" "o" "u" "n" "d" "."
sentences %>%
head(5) %>%
str_split(boundary("line_break"))
# [[1]]
# [1] "The " "birch " "canoe " "slid " "on " "the " "smooth " "planks."
#
# [[2]]
# [1] "Glue " "the " "sheet " "to " "the " "dark " "blue " "background."
#
# [[3]]
# [1] "It's " "easy " "to " "tell " "the " "depth " "of " "a " "well."
#
# [[4]]
# [1] "These " "days " "a " "chicken " "leg " "is " "a " "rare " "dish."
#
# [[5]]
# [1] "Rice " "is " "often " "served " "in " "round " "bowls."
sentences %>%
head(5) %>%
str_split(boundary("sentence"))
# [[1]]
# [1] "The birch canoe slid on the smooth planks."
#
# [[2]]
# [1] "Glue the sheet to the dark blue background."
#
# [[3]]
# [1] "It's easy to tell the depth of a well."
#
# [[4]]
# [1] "These days a chicken leg is a rare dish."
#
# [[5]]
# [1] "Rice is often served in round bowls."
4.10 練習
4.10.1 拆分字符串"apples, pears, and bananas"帚桩。
"apples, pears, and bananas" %>%
str_split(boundary("character"))
# [[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a" "n" "a" "n" "a" "s"
"apples, pears, and bananas" %>%
str_split(boundary("word"))
# [[1]]
# [1] "apples" "pears" "and" "bananas"
"apples, pears, and bananas" %>%
str_split(boundary("sentence"))
# [[1]]
# [1] "apples, pears, and bananas"
4.10.2 為什么使用boundary("word") 的拆分效果要比" " 好亿驾?
# 使用空格的情況
"apples, pears, and bananas" %>%
str_split(" ")
# [[1]]
# [1] "apples," "pears," "and" "bananas"
4.10.3 使用空字符串("")進行拆分會得到什么結(jié)果?嘗試一下账嚎,然后閱讀文檔莫瞬。
# 使用空字符串("")進行拆分,會拆分所有字符
"apples, pears, and bananas" %>%
str_split("")
# [[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a" "n" "a" "n" "a" "s"
4.11 定位匹配內(nèi)容
str_locate()
和str_locate_all()
函數(shù)可以給出每個匹配的開始位置和結(jié)束位置醉锄。
使用str_locate() 函數(shù)找出匹配的模式,然后使用str_sub()
函數(shù)來提取或修改匹配的內(nèi)容浙值。
head(sentences,5)
# [1] "The birch canoe slid on the smooth planks." "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well." "These days a chicken leg is a rare dish."
# [5] "Rice is often served in round bowls."
str_locate(head(sentences,5),"days")
# start end
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
# [4,] 7 10
# [5,] NA NA
5 其他類型的模式
當使用一個字符串作為模式時恳不,R 會自動調(diào)用regex() 函數(shù)對其進行包裝:
正常調(diào)用:
str_view(fruit, "nana")
上面形式是以下形式的簡寫
str_view(fruit, regex("nana"))
你可以使用regex() 函數(shù)的其他參數(shù)來控制具體的匹配方式。
- ignore_case = TRUE 既可以匹配大寫字母开呐,也可以匹配小寫字母烟勋,它總是使用當前的區(qū)域設(shè)置:
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
- multiline = TRUE 可以使得^ 和$ 從每行的開頭和末尾開始匹配,而不是從完整字符串
的開頭和末尾開始匹配:
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"
- comments = TRUE 可以讓你在復雜的正則表達式中加入注釋和空白字符筐付,以便更易理解卵惦。
匹配時會忽略空格和# 后面的內(nèi)容。如果想要匹配一個空格瓦戚,你需要對其進行轉(zhuǎn)義:"\ ":
phone <- regex("
\\(? # 可選的開括號
(\\d{3}) # 地區(qū)編碼
[)- ]? # 可選的閉括號沮尿、短劃線或空格
(\\d{3}) # 另外3個數(shù)字
[ -]? # 可選的空格或短劃線
(\\d{3}) # 另外3個數(shù)字
", comments = TRUE)
str_match("514-791-8141", phone)
#> [,1] [,2] [,3] [,4]
#> [1,] "514-791-814" "514" "791" "814"
- dotall = TRUE 可以使得. 匹配包括\n 在內(nèi)的所有字符。
5.1 除了regex()较解,你還可以使用其他3 種函數(shù)畜疾。
-
fixed()
函數(shù)可以按照字符串的字節(jié)形式進行精確匹配,它會忽略正則表達式中的所有特殊字符印衔,并在非常低的層次上進行操作啡捶。這樣可以讓你不用進行那些復雜的轉(zhuǎn)義操作,而且速度比普通正則表達式要快很多奸焙。從以下的微基準測試可以看出瞎暑,在這個簡單的示例中彤敛,它的速度差不多是普通正則表達式的3 倍:
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> fixed 116 117 136 120 125 389 20 a
#> regex 333 337 346 338 342 467 20 b
在匹配非英語數(shù)據(jù)時,要慎用fixed() 函數(shù)了赌。它可能會出現(xiàn)問題墨榄,因為此時同一個字符經(jīng)常有多種表達方式。例如揍拆,定義á 的方式有兩種:一種是單個字母a渠概,另一種是a 加上重音符號
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
#> [1] "á" "á"
a1 == a2
#> [1] FALSE
這兩個字母的意義相同,但因為定義方式不同嫂拴,所以fixed() 函數(shù)找不到匹配播揪。然而,你可以使用接下來將要介紹的coll() 函數(shù)筒狠,按照我們使用的字符比較規(guī)則來進行匹配:
str_detect(a1, fixed(a2))
#> [1] FALSE
str_detect(a1, coll(a2))
#> [1] TRUE
-
coll()
函數(shù)使用標準排序規(guī)則來比較字符串猪狈,這在進行不區(qū)分大小寫的匹配時是非常有效的。注意辩恼,可以在coll() 函數(shù)中設(shè)置locale 參數(shù)雇庙,以確定使用哪種規(guī)則來比較字符。遺憾的是灶伊,世界各地所使用的規(guī)則是不同的疆前!
# 這意味著在進行不區(qū)分大小寫的匹配時,還是需要知道不同規(guī)則之間的區(qū)別:
i <- c("I", "?", "i", "?")
i
#> [1] "I" "?" "i" "?"
str_subset(i, coll("i", ignore_case = TRUE))
#> [1] "I" "i"
str_subset(
i,
coll("i", ignore_case = TRUE, locale = "tr")
)
#> [1] "?" "i"
fixed()
和regex()
函數(shù)中都有ignore_case 參數(shù)聘萨,但都無法選擇區(qū)域設(shè)置竹椒,它們總是使用默認的區(qū)域設(shè)置。你可以使用以下代碼查看默認區(qū)域設(shè)置(我們稍后會對stringi 包進行更多介紹):
stringi::stri_locale_info()
#> $Language
#> [1] "en"
#>
#> $Country
#> [1] "US"
#>
#> $Variant
#> [1] ""
#>
#> $Name
#> [1] "en_US"
coll()
函數(shù)的弱點是速度米辐,因為確定哪些是相同字符的規(guī)則比較復雜胸完,與regex()
和fixed()
函數(shù)相比,coll()
確實比較慢翘贮。
在介紹str_split()
函數(shù)時赊窥,你已經(jīng)知道可以使用boundary()
函數(shù)來匹配邊界。你還可以在其他函數(shù)中使用這個函數(shù):
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This" "is" "a" "sentence"
6 正則表達式的其他應(yīng)用
R 基礎(chǔ)包中有兩個常用函數(shù)狸页,它們也可以使用正則表達式锨能。
? apropos()
函數(shù)可以在全局環(huán)境空間中搜索所有可用對象。當不能確切想起函數(shù)名稱時芍耘,
這個函數(shù)特別有用:
apropos("replace")
#> [1] "%+replace%" "replace" "replace_na"
#> [4] "str_replace" "str_replace_all" "str_replace_na"
#> [7] "theme_replace"
? dir()
函數(shù)可以列出一個目錄下的所有文件腹侣。dir() 函數(shù)的patten 參數(shù)可以是一個正則
表達式,此時它只返回與這個模式相匹配的文件名齿穗。例如傲隶,你可以使用以下代碼返回當
前目錄中的所有R Markdown 文件:
head(dir(pattern = "\\.Rmd$"))
#> [1] "communicate-plots.Rmd" "communicate.Rmd"
#> [3] "datetimes.Rmd" "EDA.Rmd"
#> [5] "explore.Rmd" "factors.Rmd"
7 stringi
stringr
建立于stringi
的基礎(chǔ)之上。stringr 非常容易學習窃页,因為它只提供了非常少的函數(shù)跺株,這些函數(shù)是精挑細選的复濒,可以完成大部分常用字符串操作功能。與stringr 不同乒省,stringi 的設(shè)計思想是盡量全面巧颈,幾乎包含了我們可以用到的所有函數(shù):stringi 中有234 個函數(shù),而stringr 中只有42 個袖扛。
如果你發(fā)現(xiàn)某些工作很難使用stringr 來完成砸泛,那么可以考慮使用stringi。因為這兩個包中的函數(shù)的工作方式非常相似蛆封,所以你可以很自然地從stringr 過渡到stringi唇礁。主要區(qū)別是前綴:str_
與stri_
。
Reference
1.https://blog.csdn.net/u011596455/article/details/79600579
2.http://www.reibang.com/p/4790b00dc238