1) 常見元字符
代碼 | 代碼含義 |
---|---|
. | 匹配除換行符以外的任意字符 |
\w | 匹配字母或數(shù)字或下劃線 |
\s | 匹配任意的空白符 |
\d | 匹配數(shù)字 |
\b | 匹配單詞的開始或結(jié)束(定位符) |
^ | 匹配字符串的開始(定位符) |
$ | 匹配字符串的結(jié)束(定位符) |
2)\b 匹配首尾位置
library(stringr)
str_extract_all("a test of capitalizing", "\\b(\\w)")
# [[1]]
# [1] "a" "t" "o" "c"
str_extract_all("a test of capitalizing", "(\\w)\\b")
# [[1]]
# [1] "a" "t" "f" "g"
txt1 <- "capitalizing"
str_extract_all(txt1, "\\b[a-z]")
# [[1]]
# [1] "c"
3) \w 匹配字符串(字母榛斯,數(shù)字奕枢,下劃線)
str_extract_all(txt, "\\w")
# [[1]]
# [1] "a" "t" "e" "s" "t" "o" "f" "c" "a" "p" "i" "t" "a" "l" "i" "z" "i" "n" "g"
str_extract_all(txt, "(\\w)(\\w)(\\w)")
# [[1]]
# [1] "tes" "cap" "ita" "liz" "ing"
4) Package stringr
(20220910更新)
函數(shù) | 功能說明 | R base包對應(yīng)函數(shù) |
---|---|---|
使用正則表達(dá)式的函數(shù) | ||
str_extract() | 提取首個匹配模式的字符 | regmatches() |
str_extract_all() | 提取所有匹配模式的字符 | regmatches() |
str_locate() | 返回首個匹配模式的字符的位置 | regexpr() |
str_locate_all() | 返回所有匹配模式的字符的位置 | gregexpr() |
str_replace() | 替換首個匹配模式 | sub() |
str_replace_all() | 替換所有匹配模式 | gsub() |
str_split() | 按照模式分割字符串 | strsplit() |
str_split_fixed() | 按照模式將字符串分割成指定個數(shù) | - |
str_detect() | 檢測字符是否存在某些指定模式 | grepl() |
str_count() | 返回指定模式出現(xiàn)的次數(shù) | - |
其他重要函數(shù) | ||
str_sub() | 提取指定位置的字符 | regmatches() |
str_dup() | 丟棄指定位置的字符 | - |
str_length() | 返回字符的長度 | nchar() |
str_pad() | 填補(bǔ)字符 | - |
str_trim() | 丟棄填充叹谁,如去掉字符前后的空格 | - |
str_c() | 連接字符 | paste(),paste0() |
5)Case1: 數(shù)據(jù)拆分
將中括號內(nèi)外的內(nèi)容分別提取出來
head(Bracket)
#[1] "pyrroline-5-carboxylate reductase [EC:1.5.1.2]" "proline dehydrogenase [EC:1.5.-.-]"
#[3] "tyrosine-protein kinase Etk/Wzc [EC:2.7.10.-]" "glycosyltransferase EpsD [EC:2.4.-.-]"
#[5] "glycosyltransferase EpsE [EC:2.4.-.-]" "glycosyltransferase EpsF [EC:2.4.-.-]"
hou1 = str_extract(Bracket, "\\[.*\\]"); head(hou1)
# [1] "[EC:1.5.1.2]" "[EC:1.5.-.-]" "[EC:2.7.10.-]" "[EC:2.4.-.-]" "[EC:2.4.-.-]" "[EC:2.4.-.-]"
qian = str_extract(Bracket, "(?<=^)(.*)(?=\\[)"); head(qian)
# [1] "pyrroline-5-carboxylate reductase " "proline dehydrogenase "
# [3] "tyrosine-protein kinase Etk/Wzc " "glycosyltransferase EpsD "
# [5] "glycosyltransferase EpsE " "glycosyltransferase EpsF "
data = data.frame(annotation = qian, enzyme = hou1); head(data)
# annotation enzyme
# 1 pyrroline-5-carboxylate reductase [EC:1.5.1.2]
# 2 proline dehydrogenase [EC:1.5.-.-]
# 3 tyrosine-protein kinase Etk/Wzc [EC:2.7.10.-]
# 4 glycosyltransferase EpsD [EC:2.4.-.-]
# 5 glycosyltransferase EpsE [EC:2.4.-.-]
# 6 glycosyltransferase EpsF [EC:2.4.-.-]
\\
6)Case2: 匹配替換
(20220910更新)
某一數(shù)據(jù)框"Trophic.Mode"列,只要字符串中出現(xiàn)"Pathotroph"效扫,就將其重命名為"Pathotroph"康吵,否則保持不變戒幔。
# 函數(shù)grepl + 函數(shù)ifelse 完成替換
> head(FunGuild0903$Trophic.Mode, 15)
[1] "Symbiotroph" "Symbiotroph" "Symbiotroph" "Symbiotroph"
[5] "Symbiotroph" "Symbiotroph" "Symbiotroph" "Symbiotroph"
[9] "Symbiotroph" "Saprotroph-Symbiotroph" "Saprotroph-Symbiotroph" "Saprotroph-Symbiotroph"
[13] "Saprotroph-Symbiotroph" "Saprotroph-Symbiotroph" "Saprotroph"
> for (i in 1:361) {
+ FunGuild0903$Trophic.Mode[i] <- ifelse(grepl("Saprotroph", FunGuild0903$Trophic.Mode[i]),
+ "Saprotroph",
+ FunGuild0903$Trophic.Mode[i])
+ }
> head(FunGuild0903$Trophic.Mode, 15)
[1] "Symbiotroph" "Symbiotroph" "Symbiotroph" "Symbiotroph" "Symbiotroph" "Symbiotroph" "Symbiotroph"
[8] "Symbiotroph" "Symbiotroph" "Saprotroph" "Saprotroph" "Saprotroph" "Saprotroph" "Saprotroph"
[15] "Saprotroph"
函數(shù)grepl() 的主要參數(shù)如下:
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
R官方文檔中這樣解釋:grepl returns a logical vector (match or not for each element of x),即grepl返回邏輯判斷值份氧,"pattern"是否匹配到x中元素唯袄。