最近需要盡可能的下載細(xì)菌基因組挣郭,ncbi ftp站點(diǎn)(ftp://ftp.ncbi.nlm.nih.gov/genomes)里面"all"這個(gè)目錄可以下載,不過"all"目錄還有古細(xì)菌之類的哈倔矾。如果只想取Refseq或genbank數(shù)據(jù)庫其一,可以通過以下方式:
#下載Refseq/genbank數(shù)據(jù)庫中的細(xì)菌基因組數(shù)據(jù)
#截止20190106,有140681條基因組數(shù)據(jù)
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
#文件第12列是拼接狀態(tài)(Complete Genome/Scaffold/contig),第20列是ftp路徑载慈,下面是下載組裝完整的基因組信息
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gbff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
cut -d / -f 11 ftpfilepaths | paste - ftpfilepaths | while read a b;do echo "wget -c -nd -r -np -k -L -p -nd -P genbank $b && gzip -d genbank/$a";done >run.download.sh
#腳本只下載了gbff文件,當(dāng)然珍手,如果想要gff,fna,faa通過小腳本轉(zhuǎn)就可以了,或者加上“-A faa.gz,fna.gz,gff.gz,gpff.gz”
真的超級(jí)占存儲(chǔ)空間辞做,我只下載gbff文件琳要,需要其他格式的文件再簡單格式轉(zhuǎn)化就可以了。同時(shí)秤茅,下載一批立馬處理一批稚补。
Refseq和genbank的區(qū)別
Refseq下載的是GCF開頭的文件,而genbank下載的是GCA開頭的文件框喳。
具體看這里:https://zhuanlan.zhihu.com/p/20749737
GenBank是核苷酸數(shù)據(jù)庫课幕,RefSeq是基因數(shù)據(jù)庫 ,具體哪個(gè)更全五垮?互為補(bǔ)充乍惊,以前做分析的時(shí)候發(fā)現(xiàn)有些細(xì)菌基因組在GenBank里面沒有在RefSeq有,還有可能兩個(gè)數(shù)據(jù)庫里面某個(gè)基因組只有一個(gè)fna文件沒有注釋放仗,這種情況如果遇到自己動(dòng)手用prokka十分鐘就能解決润绎。
生信小腳本收藏癖,有時(shí)遇到比較實(shí)用的腳本會(huì)存下來放在自己環(huán)境變量bin里面诞挨,積累會(huì)使以后的分析變得更加高效
附上兩個(gè)網(wǎng)上看到好用的腳本
1.GenBank轉(zhuǎn)faa格式
#!/usr/bin/env perl
# This program reads gbk files and extracts all amino acid sequences from the
# /translation fields into a FASTA file. The FASTA header contains a sequential
# number followed by the taxon id, which is extracted from the
# /db_xref="taxon:<ID>" field. Only letters in the 20 letter amino acid
# alphabet are retained in the FASTA file.
#
# author: Peter Menzel
#
# This file is part of Kaiju, Copyright 2015-2017 Peter Menzel and Anders Krogh
# Kaiju is licensed under the GPLv3, see the file LICENSE.
#
use strict;
use warnings;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;
my $t = 0;
my $taxid;
my $protein_id;
if(!defined $ARGV[1]) { die "Usage: $0 infile.gbk outfile.faa"; }
open(OUT,">",$ARGV[1]) or die "Could not open file $ARGV[1] for writing.";
my $in_fh = new IO::Uncompress::AnyUncompress $ARGV[0] or die "Opening input file failed: $AnyUncompressError\n";
while(<$in_fh>) {
chomp;
if(m,/db_xref="taxon:(\d+)",) {
$taxid = $1;
}
elsif(m,/protein_id="([^"]+)",) {
$protein_id=$1;
}
elsif(m,\s+/translation="([^"]+)",) {
if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
print OUT ">$protein_id\_$taxid\n";
my $seq = $1;
$seq =~ tr/BZ/DE/; # a.a. alphabet specifies `B' matches `N' or `D', and `Z' matches `Q' or `E.', here we use substitution with higher score
$seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT "$seq\n";
}
elsif(m,\s+/translation="([^"]+)$,) {
if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
print OUT ">$protein_id\_$taxid\n";
$t = 1;
my $seq = $1;
$seq =~ tr/BZ/DE/;
$seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT "$seq\n";
}
elsif($t) {
if(m,",) {
tr/BZ/DE/;
s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT $_,"\n";
$t = 0;
}
else {
tr/BZ/DE/;
s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT $_,"\n";
}
}
}
close($in_fh);
close(OUT);
2.Genbank轉(zhuǎn)fna格式
#!/usr/bin/env python
import sys
from Bio import SeqIO
if len(sys.argv) < 3:
print('USAGE: gbk2fna GBK FNA')
sys.exit(65)
SeqIO.write(SeqIO.parse(sys.argv[1], 'genbank'), sys.argv[2], 'fasta')