前言
需求:N個fasta文件半开,里面的序列可能有大量的重復隅茎,但是其header并不一定相同裙盾,需要把他們合并并去重朱巨。
方法一
首先用cat命令將他們合并
cat *.fasta > merge.fasta
perl合并
perl rmfastadup.pl merge.fasta nodup.fasta
rmfastadup.pl 代碼如下:
#! /usr/bin/perl
# Downloaded from http://www.bioinformatics-made-simple.com
# 去除fasta文件中重復的序列
# 用法:perl rmfastadup.pl input,fasta output.fasta
use Bio::Perl;
my $infile1 = $ARGV[0];
my %FIRSTSEQ;
my $total = 0;
my $total1=0;
my $dup = 0;
open (OUTIE, ">$ARGV[1]");
$infile1 = Bio::SeqIO -> new('-format'=>'Fasta',-file => $infile1);
#read in all fasta sequences
while ((my $seqobj = $infile1->next_seq()))
{
$rawid = $seqobj -> display_id;
$seq = $seqobj -> seq;
#print "ID is $rawid\n";
#print "Sequence is $seq\n";
if(defined($FIRSTSEQ{$seq}))
{
print "Key match with $FIRSTSEQ{$seq}\n";
$dup++;
}
else
{
$FIRSTSEQ{$seq} = $rawid;
$total++;
}
}
while ( ($key, $value) = each(%FIRSTSEQ) )
{
print OUTIE ">$value\n";
print OUTIE "$key\n\n";
$total1++;
}
print "\n$total unduplicated sequences in the file\n";
print "$dup duplicated sequences in the file\n";
print "$total1 unique sequences printed out.\n";
close(OUTIE);
這個方法在fasta文件數(shù)量少、序列少的情況下比較簡單好用颖医,但是序列數(shù)多了以后就會賊慢位衩,讓人受不了,親測10G的fasta整整跑了四個多小時熔萧,吐血糖驴。
方法二
大量的序列僚祷,必須通過多線程的編程來提高效率,下面提供了一個perl代碼贮缕,是上面的多線程版本久妆。
用法
perl p-rmfastadup.pl file1.fasta file2.fasta <...>
最后會生成nodup.fasta文件。
另外跷睦,核心數(shù)可以自行修改筷弦,具體看代碼中的注釋吧
p-rmfastadup.pl的代碼如下
#! /usr/bin/perl
#同時處理X個fasta文件,最終去除這些fasta文件中序列相同的(header可以不同抑诸,一概被認為是同一序列)序列烂琴,最終輸出無重復的fasta文件:nodup.fasta
#usage: perl p-rmfastadup.pl file1.fasta file2.fasta <...>
use threads;
use threads::shared;
use Bio::Perl;
my %FIRSTSEQ : shared;
my $total : shared;
my $total1 : shared;
my $dup : shared;
my @files : shared;
@files = @ARGV;
# 創(chuàng)建線程,可以自行增減
my $t1 = threads->create(\&checkfasta);
sleep(0.1);
my $t2 = threads->create(\&checkfasta);
my $t3 = threads->create(\&checkfasta);
my $t4 = threads->create(\&checkfasta);
my $t5 = threads->create(\&checkfasta);
my $t6 = threads->create(\&checkfasta);
my $t7 = threads->create(\&checkfasta);
my $t8 = threads->create(\&checkfasta);
my $t9 = threads->create(\&checkfasta);
my $t10 = threads->create(\&checkfasta);
# 收割線程蜕乡,需要配合上面的線程
$t1->join;
$t2->join;
$t3->join;
$t4->join;
$t5->join;
$t6->join;
$t7->join;
$t8->join;
$t9->join;
$t10->join;
sub checkfasta {
while (@files) {
my $infile1 = shift @files;
print $infile1."\n";
$infile1 = Bio::SeqIO -> new('-format'=>'Fasta',-file => $infile1);
#read in all fasta sequences
while ((my $seqobj = $infile1->next_seq()))
{
$rawid = $seqobj -> display_id;
$seq = $seqobj -> seq;
#print "ID is $rawid\n";
#print "Sequence is $seq\n";
if(defined($FIRSTSEQ{$seq}))
{
print "Key match with $FIRSTSEQ{$seq}\n";
$dup++;
}
else
{
$FIRSTSEQ{$seq} = $rawid;
$total++;
}
}
}
}
#checkfasta();
open (OUTIE, ">nodup.fasta");
while ( ($key, $value) = each(%FIRSTSEQ) )
{
print OUTIE ">$value\n";
print OUTIE "$key\n\n";
$total1++;
}
print "$total1 unique sequences printed out.\n";
close(OUTIE);
2020.6.26更新
沒事干造什么輪子奸绷,已經(jīng)有人寫好啦!用seqkit完美解決需求层玲!