Basic Local Alignment Search Tool (BLAST)
個用來比對生物序列的一級結(jié)構(gòu)(如不同蛋白質(zhì)的氨基酸序列或不同基因的DNA序列)的算法河哑。 已知一個包含若干序列的數(shù)據(jù)庫垮耳,BLAST可以讓研究者在其中尋找與其感興趣的序列相同或類似的序列。 例如如果某種非人動物的一個以前未知的基因被發(fā)現(xiàn)致讥,研究者一般會在人類基因組中做一個BLAST搜索來確認(rèn)人類是否包含類似的基因(通過序列的相似性)。BLAST算法以及實現(xiàn)它的程序由美國國家生物技術(shù)信息中心(NCBI)的Warren Gish幻梯、David J. Lipman及Webb Miller博士開發(fā)的算行。(from wikipedia)
A suite of tools
The key concepts of BLAST
-Search may take place in nucleotide and/or protein space or translated spaces where nucleotides are translated into proteins.
-Searches may implement search “strategies”: optimizations to a certain task. Different search strategies will return different alignments.
-Searches use alignments that rely on scoring matrices
-Searches may be customized with many additional parameters. BLAST has many subtle functions that most users never need.
使用BLAST 的基本步驟
1.使用makeblastdb建立BLAST數(shù)據(jù)庫
2.合適的選擇blastn、blastp、blsatx等工具
3.運行工具并在需要的時候格式化輸出結(jié)果
Build a blast database
#建立database目錄
mkdir -p ~/refs/ebola
#獲取ebola病毒核酸序列
efetch -db nucleotide -id KM233118 --format fasta > ~/refs/ebola/KM233118.fa
makeblastdb命令建立ebola核酸序列database
makeblastdb -help | more
USAGE
makeblastdb [-h] [-help] [-in input_file] [-input_type type]
-dbtype molecule_type [-title database_title] [-parse_seqids]
[-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
[-mask_desc mask_algo_descriptions] [-gi_mask]
[-gi_mask_name gi_based_mask_names] [-out database_name]
[-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
[-taxid_map TaxIDMapFile] [-version]
DESCRIPTION
Application to create BLAST databases, version 2.7.1+
REQUIRED ARGUMENTS
-dbtype <String, `nucl', `prot'>
Molecule type of target db
OPTIONAL ARGUMENTS
-h
Print USAGE and DESCRIPTION; ignore all other parameters
-help
Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
-version
Print version number; ignore other arguments
*** Input options
-in <File_In>
Input file/database name
Default = `-'
-input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>
Type of the data specified in input_file
Default = `fasta'
> *** Configuration options
-title <String>
Title for BLAST database
Default = input file name provided to -in argument
-parse_seqids
Option to parse seqid for FASTA input if set, for all other input types
seqids are parsed automatically
-hash_index
Create index of sequence hash values.
*** Sequence masking options
-mask_data <String>
Comma-separated list of input files containing masking data as produced by
NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
-mask_id <String>
Comma-separated list of strings to uniquely identify the masking algorithm
* Requires: mask_data
* Incompatible with: gi_mask
-mask_desc <String>
Comma-separated list of free form strings to describe the masking algorithm
details
* Requires: mask_id
-gi_mask
Create GI indexed masking data.
* Requires: parse_seqids
* Incompatible with: mask_id
-gi_mask_name <String>
Comma-separated list of masking data output files.
* Requires: mask_data, gi_mask
*** Output options
-out <String>
Name of BLAST database to be created
Default = input file name provided to -in argumentRequired if multiple
file(s)/database(s) are provided as input
-max_file_sz <String>
Maximum file size for BLAST database files
Default = `1GB'
-logfile <File_Out>
File to which the program log should be redirected
*** Taxonomy options
-taxid <Integer, >=0>
Taxonomy ID to assign to all sequences
* Incompatible with: taxid_map
-taxid_map <File_In>
Text file mapping sequence IDs to taxonomy IDs.
Format:<SequenceId> <TaxonomyId><newline>
* Requires: parse_seqids
* Incompatible with: taxid
#創(chuàng)建ebola核酸序列數(shù)據(jù)庫
makeblastdb -in ~/refs/ebola/KM233118.fa -dbtype nucl -out ~/refs/ebola/KM233118
創(chuàng)建PRJNA257197氨基酸序列數(shù)據(jù)庫
#下載PRJNA257197所有蛋白質(zhì)序列fasta文件
esearch -db protein -query PRJNA257197 | efetch -format fasta > index/all-proteins.fa
#創(chuàng)建氨基酸序列數(shù)據(jù)庫
makeblastdb -in index/all-proteins.fa -dbtype prot -out index/all -parse_seqids
#列出數(shù)據(jù)庫內(nèi)的內(nèi)容夹姥,以“%a”accession格式顯示
blastdbcmd -db index/all -entry 'all' -outfmt "%a" | head
BLAST database的下載
NCBI提供許多物種和幾乎所有的已知序列的數(shù)據(jù)庫的下載
website
#創(chuàng)建目錄用于存放下載的數(shù)據(jù)庫
mkdir -p ~refs/refseq
cd ~/ref/refseq
#blast軟件包中已有update_blastdb.pl用于下載NCBI已經(jīng)做好的數(shù)據(jù)庫
#查看所有數(shù)據(jù)庫
update_blastdb.pl | more
#下載16 microbial database
update_blastdb.pl 16SMicrobial --decompress
#下載分類數(shù)據(jù)庫
update_blastdb.pl taxdb --decompress
#將數(shù)據(jù)路徑加入系統(tǒng)環(huán)境變量杉武,這也是分類檢索所必須的(for MAC)
echo "export BLASTDB=$BLASTDB:~/refs/refseq/" >> ~/.bahs_profile
source ~/.bash_profile
(未完待續(xù))