提取fasta文件特定ID序列,及其計算序列長度

參考:

https://www.biostars.org/p/127141/
快速計算fasta序列長度的方法


Tips:

有時候需要通過具體的ID裸违,獲取序列信息,及其計算序列長度.
(下面主要通過linux 相關命令來實現(xiàn))

### 測試數(shù)據(jù):
$ cat >test.fa
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
>blah_C2
ACTTTATATATT
>blah_C3
ACTTATATATATATA
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

$ cat >IDs.txt
blah_C4
blah_C5

Part1: 提取特定ID的序列

  1. You can do that with BBTools: 里面好多java 軟件包本昏,都被shell 封裝好了.
$ ./filterbyname.sh 

Written by Brian Bushnell
Last modified September 1, 2016

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

Parameters:
include=f       Set to 'true' to include the filtered names rather than excluding them.
substring=f     Allow one name to be a substring of the other, rather than a full match.
                   f: No substring matching.
                   t: Bidirectional substring matching.
                   header: Allow input read headers to be substrings of names in list.
                   name: Allow names in list to be substrings of input read headers.
prefix=f        Allow names to match read header prefixes.
case=t          (casesensitive) Match case also.
ow=t            (overwrite) Overwrites files that already exist.
app=f           (append) Append to files that already exist.
zl=4            (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f           (interleaved) Determines whether INPUT file is considered interleaved.
names=          A list of strings or files.  The files can have one name per line, or
                be a standard read file (fasta, fastq, or sam).
minlen=0        Do not output reads shorter than this.
ths=f           (truncateheadersymbol) Ignore a leading @ or > symbol in the names file.
tws=f           (truncatewhitespace) Ignore leading or trailing whitespace in the names file.
truncate=f      Set both ths and tws at the same time.

Positional parameters:
These optionally allow you to output only a portion of a sequence.  Zero-based, inclusive.
Intended for a single sequence and include=t mode.
from=-1         Only print bases starting at this position.
to=-1           Only print bases up to this position.
range=          Set from and to with a single flag.


Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
                    The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

To read from stdin, set 'in=stdin'.  The format should be specified with an extension, like 'in=stdin.fq.gz'
To write to stdout, set 'out=stdout'.  The format should be specified with an extension, like 'out=stdout.fasta'

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

太長了供汛,直接看例子:

### 程序運行:
$ ./filterbyname.sh  in=test.fa names=IDs.txt out=out.fa include
/public/home/kcao/tools/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx13206m -cp /public/home/kcao/tools/bbmap/current/ driver.FilterReadsByName in=test.fa names=IDs.txt out=out.fa include
Executing driver.FilterReadsByName [in=test.fa, names=IDs.txt, out=out.fa, include]

Input is being processed as unpaired
Time:                           0.098 seconds.
Reads Processed:           5    0.05k reads/sec
Bases Processed:          87    0.00m bases/sec
Reads Out:          2
Bases Out:          27
### 查看結(jié)果:
[01:21:30] kcao@login:~/tools/bbmap
$ cat out.fa 
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

2.1 awk 也可以搞定

$ awk 'NR==1{printf $0"\t";next}{printf /^>/ ? "\n"$0"\t" : $0}' test.fa | awk -F"\t" 'BEGIN{while((getline k < "IDs.txt")>0)i[k]=1}{gsub("^>","",$0); if(i[$1]){print ">"$1"\n"$2}}'
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

2.2 不完善版本:

$ awk -v RS='>' 'match($0, "blah_C1"){print ">"$0}' test.fa 
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG

3.Bioawk 也可以

$ bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' test.fa
>blah_C4
ACTTATATATATATA
>blah_C5
ACTTTATATATT

4.UCGS utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in "main.fasta"

  1. python(不完善:每次提取一個ID)
$ cat >python_extract_ID.py
import sys
name = sys.argv[2]
flag = 0
with open(sys.argv[1]) as f:
    for line in f:
        line = line.strip()
        if line[0] == '>':
            if line.split()[0][1:] == name:
                flag = 1
                print(line)
            else :
                flag = 0
        else:
            if flag == 0:
                pass
            else:
                print(line)

$ python python_extract_ID.py test.fa "blah_C1"
>blah_C1
ACTGTCTGTC
ACTGTGTTGTG
ATGTTGTGTGTG
小結(jié):個人推薦awk 那個代碼,修改兩個參數(shù)就可以跑凛俱。

Part2 : 計算序列長度

  • awk
$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa 
>blah_C1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
>blah_C2 ACTTTATATATT
>blah_C3 ACTTATATATATATA
>blah_C4 ACTTATATATATATA
>blah_C5 ACTTTATATATT

awk 里面length 函數(shù).

$ awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0" ":$0 }'  test.fa | awk '{print $1"\t"length($2)}'
>blah_C1    33
>blah_C2    12
>blah_C3    15
>blah_C4    15
>blah_C5    12

歡迎評論補充~

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末紊馏,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子蒲犬,更是在濱河造成了極大的恐慌朱监,老刑警劉巖,帶你破解...
    沈念sama閱讀 218,122評論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件原叮,死亡現(xiàn)場離奇詭異赫编,居然都是意外死亡,警方通過查閱死者的電腦和手機奋隶,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,070評論 3 395
  • 文/潘曉璐 我一進店門擂送,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人唯欣,你說我怎么就攤上這事嘹吨。” “怎么了境氢?”我有些...
    開封第一講書人閱讀 164,491評論 0 354
  • 文/不壞的土叔 我叫張陵蟀拷,是天一觀的道長。 經(jīng)常有香客問我萍聊,道長问芬,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,636評論 1 293
  • 正文 為了忘掉前任寿桨,我火速辦了婚禮此衅,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己挡鞍,他們只是感情好骑歹,可當我...
    茶點故事閱讀 67,676評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著墨微,像睡著了一般陵刹。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上欢嘿,一...
    開封第一講書人閱讀 51,541評論 1 305
  • 那天,我揣著相機與錄音也糊,去河邊找鬼炼蹦。 笑死,一個胖子當著我的面吹牛狸剃,可吹牛的內(nèi)容都是我干的掐隐。 我是一名探鬼主播,決...
    沈念sama閱讀 40,292評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼钞馁,長吁一口氣:“原來是場噩夢啊……” “哼虑省!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起僧凰,我...
    開封第一講書人閱讀 39,211評論 0 276
  • 序言:老撾萬榮一對情侶失蹤探颈,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后训措,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體伪节,經(jīng)...
    沈念sama閱讀 45,655評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,846評論 3 336
  • 正文 我和宋清朗相戀三年绩鸣,在試婚紗的時候發(fā)現(xiàn)自己被綠了怀大。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 39,965評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡呀闻,死狀恐怖化借,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情捡多,我是刑警寧澤蓖康,帶...
    沈念sama閱讀 35,684評論 5 347
  • 正文 年R本政府宣布,位于F島的核電站局服,受9級特大地震影響钓瞭,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜淫奔,卻給世界環(huán)境...
    茶點故事閱讀 41,295評論 3 329
  • 文/蒙蒙 一山涡、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦鸭丛、人聲如沸竞穷。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,894評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽瘾带。三九已至,卻和暖如春熟菲,著一層夾襖步出監(jiān)牢的瞬間看政,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 33,012評論 1 269
  • 我被黑心中介騙來泰國打工抄罕, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留允蚣,地道東北人。 一個月前我還...
    沈念sama閱讀 48,126評論 3 370
  • 正文 我出身青樓呆贿,卻偏偏與公主長得像嚷兔,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子做入,可洞房花燭夜當晚...
    茶點故事閱讀 44,914評論 2 355

推薦閱讀更多精彩內(nèi)容