統(tǒng)計
1 捺檬、字符數(shù)統(tǒng)計:nchar()
- length空字符時計數(shù)為1
- nchar空字符時計數(shù)為0
> x<-c("R","is","funny")
> nchar(x)
[1] 1 2 5
> length("")
[1] 1
> nchar("")
[1] 0
2 、轉(zhuǎn)化為小寫:tolower()
> DNA <- "AtGCtttACC"
> tolower(DNA)
[1] "atgctttacc"
3 状知、轉(zhuǎn)化為大寫:toupper()
> DNA <- "AtGCtttACC"
> toupper(DNA)
[1] "ATGCTTTACC"
4秽五、替換函數(shù):chartr("","",x)
chartr("A","B",x):字符串x中使用B替換A
> DNA <- "AtGCtttACC"
> chartr("Tt","Bb",DNA)
[1] "AbGCbbbACC"
> chartr("Tt","BB",DNA)
[1] "ABGCBBBACC"
字符串連接
5、字符串連接函數(shù):paste()
> paste("Var",1:5,sep="")
[1] "Var1" "Var2" "Var3" "Var4" "Var5"
> x<-list(a='aaa',b='bbb',c="ccc")
> y<-list(d="163.com",e="qq.com")
> paste(x,y,sep="@")
[1] "aaa@163.com" "bbb@qq.com" "ccc@163.com"
#增加collapse參數(shù)饥悴,設置分隔符
> paste(x,y,sep="@",collapse=';')
[1] "aaa@163.com;bbb@qq.com;ccc@163.com"
> paste(x,collapse=';')
[1] "aaa;bbb;ccc"
字符串拆分
6坦喘、字符串拆分:strsplit()
語法格式:strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
- x為需要拆分的字串向量
- split為拆分位置的字串向量,默認為正則表達式匹配(fixed=FALSE)西设,
設置fixed=TRUE瓣铣,表示使用普通文本匹配或正則表達式的精確匹配。普通文本的運算速度快 - perl=TRUE/FALSE的設置和perl語言版本有關(guān)贷揽,如果正則表達式很長棠笑,正確設置表達式并且使用perl=TRUE可以提高運算速度。
- useBytes設置是否逐個字節(jié)進行匹配禽绪,默認為FALSE蓖救,即按字符而不是字節(jié)進行匹配洪规。
- strsplit得到的結(jié)果是列表,后面要怎么處理就得看情況而定了
> text<-"today is a \nnice day!"
> text
[1] "today is a \nnice day!"
> strsplit(text," ")
[[1]]
[1] "today" "is" "a" "\nnice" "day!"
#換行符\n
> strsplit(text,'\\s')
[[1]]
[1] "today" "is" "a" "" "nice" "day!"
> class(strsplit(text, '\\s'))
[1] "list"
> strsplit(text,"")
[[1]]
[1] "t" "o" "d" "a" "y" " " "i" "s" " " "a" " " "\n" "n" "i" "c" "e" " " "d" "a"
[20] "y" "!"
字符串查詢
7循捺、字符串查詢:grep(),grepl()
語法格式
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
兩者的差別
- grep正則匹配后返回匹配項的下標
- grepl正則匹配后返回邏輯TRUE或者FALSE
> grep("\\.r$",files)
[1] 3 5 8 9 10 11 12 16 18 20 22 24 25 26 29
> grepl("\\.r$",files)
[1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
[17] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
兩者用于提取數(shù)據(jù)子集的結(jié)果是一樣的
> files[grep("\\.r$",files)]
[1] "agricolae.r" "cluster.r" "gam_model.r"
[4] "GBM.r" "gbm_model.r" "gbm_model1.r"
[7] "GBM1.r" "item-based CF推薦算法.r" "MASS_e107_rpart.r"
[10] "PankRange.r" "quantmod.r" "recommenderlab.r"
[13] "Rplot.r" "rules.r" "survival.r"
> files[grepl("\\.r$",files)]
[1] "agricolae.r" "cluster.r" "gam_model.r"
[4] "GBM.r" "gbm_model.r" "gbm_model1.r"
[7] "GBM1.r" "item-based CF推薦算法.r" "MASS_e107_rpart.r"
[10] "PankRange.r" "quantmod.r" "recommenderlab.r"
[13] "Rplot.r" "rules.r" "survival.r"
8斩例、字符串查詢:regexpr(),gregexpr(),regexec()
- 匹配具體位置和字符串長度
- 可以用于字符串的提取操作
> text<-c("Hello, Adam","Hi,Adam!","How are you,Adam")
> text
[1] "Hello, Adam" "Hi,Adam!" "How are you,Adam"
> regexpr("Adam",text)
[1] 8 4 13
attr(,"match.length")
[1] 4 4 4
attr(,"useBytes")
[1] TRUE
> gregexpr("Adam",text)
[[1]]
[1] 8
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 4
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE
[[3]]
[1] 13
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE
> regexec("Adam",text)
[[1]]
[1] 8
attr(,"match.length")
[1] 4
[[2]]
[1] 4
attr(,"match.length")
[1] 4
[[3]]
[1] 13
attr(,"match.length")
[1] 4
字符串替換
9、字符串替換:sub(),gsub()
- 嚴格地說R語言沒有字符串替換的函數(shù)
- R語言對參數(shù)都是傳值不傳址
- sub和gsub的區(qū)別是前者只做一次替換从橘,gsub把滿足條件的匹配都做替換
> text<-c("Hello, Adam","Hi,Adam!","How are you,Ava")
> sub(pattern="Adam",replacement="word",text)
[1] "Hello, word" "Hi,word!" "How are you,Ava"
> sub(pattern="Adam|Ava",replacement="word",text)
[1] "Hello, word" "Hi,word!" "How are you,word"
> gsub(pattern="Adam|Ava",replacement="word",text)
[1] "Hello, word" "Hi,word!" "How are you,word"
字符串提取
- substr和substring函數(shù)通過位置進行字符串拆分或提取念赶,它們本身并不使用正則表達式
- 結(jié)合正則表達式函數(shù)regexpr、gregexpr或regexec使用可以非常方便地從大量文本中提取所需信息
語法格式
substr(x, start, stop)
substring(text, first, last = 1000000L)
- 第 1個參數(shù)均為要拆分的字串向量洋满,第2個參數(shù)為截取的起始位置向量晶乔,第3個參數(shù)為截取字串的終止位置向量
- substr返回的字串個數(shù)等于第一個參數(shù)的長度
- substring返回字串個數(shù)等于三個參數(shù)中最長向量長度,短向量循環(huán)使用
> x <- "123456789"
> substr(x, c(2,4), c(4,5,8))
[1] "234"
> substring(x, c(2,4), c(4,5,8))
[1] "234" "45" "2345678"
因為x的向量長度為1牺勾,substr獲得的結(jié)果只有1個字串正罢,
即第2和第3個參數(shù)向量只用了第一個組合:起始位置2,終止位置4驻民。
substring的語句三個參數(shù)中最長的向量為c(4,5,8)翻具,執(zhí)行時按短向量循環(huán)使用的規(guī)則第一個參數(shù)事實上就是c(x,x,x),
第二個參數(shù)就成了c(2,4,2)回还,最終截取的字串起始位置組合為:2-4, 4-5和2-8裆泳。