這個(gè)應(yīng)用場(chǎng)景挺多的
比如我想提取出注釋信息中previous_id
對(duì)應(yīng)的基因ID,乍一看以為可以用cut命令一點(diǎn)一點(diǎn)切出來(lái),但是又發(fā)現(xiàn)previous_id
并不都是緊挨著ID
列,并且derived_from
對(duì)應(yīng)的ID也需要提取出來(lái)。這個(gè)時(shí)候只能用到正則匹配了。
Perl
$ cat test.pl
#! /usr/bin/perl
use warnings;
use strict;
my $file_name = $ARGV[0];
open my $in_fh, "<", "$file_name";
while (<$in_fh>) {
chomp $_;
my @one_line = (split("\t", $_));
my @old_name = ( $one_line[8] =~ /TraesCS[1-7][A-D]01G[0-9]{6}[CHL]{0,2}/g );
print "@old_name\n";
}
close $in_fh;
$ perl test.pl test.gff3 | head -n 5
TraesCS1A01G000100
TraesCS1A01G000200
TraesCS1A01G000300
TraesCS1A01G000400
TraesCS1A01G000500
需要注意的是這一句:my @old_name = ( $one_line[8] =~ /TraesCS[1-7][A-D]01G[0-9]{6}[CHL]{0,2}/g );
我目前的認(rèn)識(shí)是必須用數(shù)組來(lái)接收野蝇,并且加上g
,不管是匹配一次還是多次。
R
在R里面操作很簡(jiǎn)單绕沈,用到的是stringr包锐想。
library(tidyverse)
a <- read.table("test.gff3",header = F,sep = "\t")
b <- as.character(a$V9)
#提取出第一次匹配的內(nèi)容
c <- str_extract(b,"TraesCS[1-7][ABD]01G[0-9]{6}[CHL]{0,2}")
#提取出所有匹配的內(nèi)容
#以矩陣形式返回所有匹配到的內(nèi)容,并將每一行元素個(gè)數(shù)統(tǒng)一乍狐,不夠的用""空字符串表示
d <- str_extract_all(b,"TraesCS[1-7][ABD]0[12]G[0-9]{6}[CHL]{0,2}",simplify = T) #此處的正則表達(dá)式有小改動(dòng)赠摇,以便演示能匹配到多個(gè)的情況
> head(c)
[1] "TraesCS1A01G000100" "TraesCS1A01G000200" "TraesCS1A01G000300" "TraesCS1A01G000400" "TraesCS1A01G000500" "TraesCS1A01G000600"
> head(d)
[,1] [,2] [,3]
[1,] "TraesCS1A02G000100" "TraesCS1A01G000100" "TraesCS1A02G000100"
[2,] "TraesCS1A02G000200" "TraesCS1A01G000200" "TraesCS1A02G000200"
[3,] "TraesCS1A02G000300" "TraesCS1A01G000300" "TraesCS1A02G000300"
[4,] "TraesCS1A02G000400" "TraesCS1A01G000400" "TraesCS1A02G000400"
[5,] "TraesCS1A02G000500" "TraesCS1A01G000500" "TraesCS1A02G000500"
[6,] "TraesCS1A02G000600" "TraesCS1A01G000600" "TraesCS1A02G000600"
Python
$ cat test.py
import re
for line in open('./test.gff3'):
all = re.findall("TraesCS[1-7][A-D]0[12]G[0-9]{6}[CHL]{0,2}", line)
for i in all:
print(i,end="\t")
print()
$ python3 test.py
TraesCS1A02G000100 TraesCS1A01G000100 TraesCS1A01G000200 TraesCS1A02G000100
TraesCS1A02G000200 TraesCS1A01G000200 TraesCS1A02G000200
TraesCS1A02G000300 TraesCS1A01G000300 TraesCS1A02G000300
TraesCS1A02G000400 TraesCS1A01G000400 TraesCS1A02G000400
TraesCS1A02G000500 TraesCS1A01G000500 TraesCS1A02G000500