軟件介紹
Seqkit是一款專門處理fsata/q序列文件的軟件今布,由go語言編寫经备,功能比較完善,軟件使用也很穩(wěn)定部默。
安裝方法
方法一:下載二進(jìn)制文件(最新的穩(wěn)定/開發(fā)版本)
下載地址:https://bioinf.shenwei.me/seqkit/download/只需要載您的操作系統(tǒng)的壓縮可執(zhí)行文件侵蒙,并使用tar -zxvf *.tar.gz命令或其他工具解壓即可
方法二:通過conda安裝(最新穩(wěn)定版)
conda install -c bioconda seqkit
方法三:通過homebrew安裝(最新穩(wěn)定版)
brew install seqkit
Usage:
seqkit rmdup [flags]
Flags:
-n, --by-name by full name instead of just id #通過fasta的名字去重,相同fasta ID的序列會(huì)被去除
-s, --by-seq by seq #通過fasta 的序列去重傅蹂,相同堿基組成的序列會(huì)被去除
-D, --dup-num-file string file to save number and list of duplicated seqs #用來存放被去除序列的信息的文件
-d, --dup-seqs-file string file to save duplicated seqs #用來存在被去除的序列
-h, --help help for rmdup
-i, --ignore-case ignore case
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2)
示例
1.按照fasta的ID去重纷闺,相同ID的序列被去除:
seqkit rmdup -n test.fasta -o test.rmdup.fasta
2.按照fasta序列去重,相同堿基組成的序列被去除:
seqkit rmdup -s test.fasta -o test.rmdup.fasta