FASTX-Toolkit介紹
背景介紹
高通量測序數(shù)據(jù)下機后的原始fastq文件,包含4行者铜,其中一行為質量值趁俊,另外一行則為對應序列贤牛,高通量的數(shù)據(jù)處理首先要進行質量控制,這些過程包括去接頭则酝、過濾低質量reads、去除低質量的3’和5’端,去除N較多的reads等沽讹,針對高通量測序數(shù)據(jù)的質控軟件有很多般卑,在此介紹質控工具:fastx_toolkit
FASTX-Toolkit
FASTX-Toolkit是用于短讀FASTA / FASTQ文件預處理的命令行工具的集合。 新一代測序儀通常生成FASTA或FASTQ文件爽雄,包含多個短讀序列(可能帶有質量信息)蝠检。 這種FASTA / FASTQ文件的主要處理是使用專門程序將序列映射(也稱為比對)到參考基因組或其他數(shù)據(jù)庫。 這種映射程序的示例是:Blat挚瘟,SHRiMP叹谁,LastZ,MAQ以及許多其他程序乘盖。 但是焰檩,在將序列映射到基因組之前預處理FASTA / FASTQ文件有時會更有效率 - 操作序列以產(chǎn)生更好的映射結果。 FASTX-Toolkit工具執(zhí)行其中一些預處理任務订框。
可用工具
- FASTQ-to-FASTA converter
Convert FASTQ files to FASTA files.
將FASTQ文件轉換為FASTA文件- FASTQ Information
Chart Quality Statistics and Nucleotide Distribution
圖表質量統(tǒng)計和核苷酸分布- FASTQ/A Collapser
Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
將FASTQ / A文件中的相同序列折疊成單個序列(同時保持讀取計數(shù))- FASTQ/A Trimmer
Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise)
縮短FASTQ或FASTQ文件中的讀數(shù)析苫。- FASTQ/A Renamer
Renames the sequence identifiers in FASTQ/A file
在FASTQ / A文件中重命名序列標識符- FASTQ/A Clipper
Removing sequencing adapters / linkers
刪除測序適配器/連接器- FASTQ/A Reverse-Complement
Producing the Reverse-complement of each sequence in a FASTQ/FASTA file
在FASTQ / FASTA文件中生成每個序列的反向補碼- FASTQ/A Barcode splitter
Splitting a FASTQ/FASTA files containning multiple samples
拆分包含多個樣本的FASTQ / FASTA文件- FASTA Formatter
changes the width of sequences line in a FASTA file
更改FASTA文件中序列行的寬度- FASTA Nucleotide Changer
Convets FASTA sequences from/to RNA/DNA
將FASTA序列從/轉換為RNA / DNA- FASTQ Quality Filter
Filters sequences based on quality
根據(jù)質量過濾序列- FASTQ Quality Trimmer
Trims (cuts) sequences based on quality
根據(jù)質量修剪(剪切)序列- FASTQ Masker
Masks nucleotides with 'N' (or other character) based on quality
根據(jù)質量,使用'N'(或其他字符)掩蔽核苷酸
下載
下載地址:fastx_toolkit下載鏈接
wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
tar xjvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
使用
注意事項
fastx_toolkit由一系列的命令組成穿扳,每個命令提供一個實用的小功能衩侥。在使用時需要注意以下幾點:
- 不支持壓縮格式的輸入文件
- 不允許序列中存在N堿基,這樣的序列會自動去除
- 可視化命令依賴gunplot軟件和perl的GD模塊
- 默認情況下認為fastq文件的堿基編碼格式為phred64
在安裝該軟件時尤其時運時如果遇到:make命令報錯:“fgets called with bigger size than length of destination buffer”矛物,安裝比較新版本茫死,就能解決問題。
如果在運行fastx_quality_stats 過程中出現(xiàn)“fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4”履羞,請在參數(shù)中加入“-Q 33”
參數(shù)及其使用
FASTQ-to-FASTA
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-r] = Rename sequence identifiers to numbers.
[-n] = keep sequences with unknown (N) nucleotides.
Default is to discard such sequences.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA output file. default is STDOUT.
FASTX Statistics
usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]
version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
[-h] = This helpful help screen.
[-i INFILE] = FASTA/Q input file. default is STDIN.
If FASTA file is given, only nucleotides
distribution is calculated (there's no quality info).
[-o OUTFILE] = TEXT output file. default is STDOUT.
The output TEXT file will have the following fields (one row per column):
column = column number (1 to 36 for a 36-cycles read solexa file)
count = number of bases found in this column.
min = Lowest quality score value found in this column.
max = Highest quality score value found in this column.
sum = Sum of quality score values for this column.
mean = Mean quality score value for this column.
Q1 = 1st quartile quality score.
med = Median quality score.
Q3 = 3rd quartile quality score.
IQR = Inter-Quartile range (Q3-Q1).
lW = 'Left-Whisker' value (for boxplotting).
rW = 'Right-Whisker' value (for boxplotting).
A_Count = Count of 'A' nucleotides found in this column.
C_Count = Count of 'C' nucleotides found in this column.
G_Count = Count of 'G' nucleotides found in this column.
T_Count = Count of 'T' nucleotides found in this column.
N_Count = Count of 'N' nucleotides found in this column.
max-count = max. number of bases (in all cycles)
FASTQ Quality Chart
Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]
[-p] - Generate PostScript (.PS) file. Default is PNG image.
[-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
[-o OUTPUT] - Output file name. default is STDOUT.
[-t TITLE] - Title (usually the solexa file name) - will be plotted on the graph.
FASTA/Q Nucleotide Distribution
Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]
[-p] - Generate PostScript (.PS) file. Default is PNG image.
[-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
[-o OUTPUT] - Output file name. default is STDOUT.
[-t TITLE] - Title - will be plotted on the graph.
FASTA/Q Clipper
usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
[-l N] = discard sequences shorter than N nucleotides. default is 5.
[-d N] = Keep the adapter and N bases after it.
(using '-d 0' is the same as not using '-d' at all. which is the default).
[-c] = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
[-C] = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
[-k] = Report Adapter-Only sequences.
[-n] = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
[-z] = Compress output with GZIP.
[-D] = DEBUG output.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA/Q Renamer
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)
[-n TYPE] = rename type:
SEQ - use the nucleotides sequence as the name.
COUNT - use simply counter as the name.
[-h] = This helpful help screen.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA/Q Trimmer
usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-f N] = First base to keep. Default is 1 (=first base).
[-l N] = Last base to keep. Default is entire read.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA/Q Collapser
usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-v] = verbose: print short summary of input/output counts
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTQ/A Artifacts Filter
usage: fastq_artifacts_filter [-h] [-v] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-z] = Compress output with GZIP.
[-v] = Verbose - report number of processed reads.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
FASTQ Quality Filter
usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-q N] = Minimum quality score to keep.
[-p N] = Minimum percent of bases that must have [-q] quality.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
FASTQ/A Reverse Complement
usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA Formatter
usage: fasta_formatter [-h] [-i INFILE] [-o OUTFILE] [-w N] [-t] [-e]
Part of FASTX Toolkit 0.0.7 by gordon@cshl.edu
[-h] = This helpful help screen.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-w N] = max. sequence line width for output FASTA file.
When ZERO (the default), sequence lines will NOT be wrapped -
all nucleotides of each sequences will appear on a single
line (good for scripting).
[-t] = Output tabulated format (instead of FASTA format).
Sequence-Identifiers will be on first column,
Nucleotides will appear on second column (as single line).
[-e] = Output empty sequences (default is to discard them).
Empty sequences are ones who have only a sequence identifier,
but not actual nucleotides.
Example: FASTQ Information
$ fastx_quality_stats -i BC54.fq -o bc54_stats.txt
$ fastq_quality_boxplot_graph.sh -i bc54_stats.txt -o bc54_quality.png -t "My Library"
$ fastx_nucleotide_distribution_graph.sh -i bc54_stats.txt -o bc54_nuc.png -t "My Library"
Example: FASTQ/A Manipulation
Common pre-processing work-flow:
- Covnerting FASTQ to FASTA
- Clipping the Adapter/Linker
- Trimming to 27nt (if you're analyzing miRNAs, for example)
- Collapsing the sequences
- Plotting the clipping results
Using the FASTX-toolkit from the command line:
fastq_to_fasta -v -n -i BC54.fq -o BC54.fa Input: 100000 reads.
Output: 100000 reads.fastx_clipper -v -i BC54.fa -a CTGTAGGCACCATCAATTCGTA -o BC54.clipped.fa
Clipping Adapter: CTGTAGGCACCATCAATTCGTA
Min. Length: 15
Input: 100000 reads.
Output: 92533 reads.
discarded 468 too-short reads.
discarded 6939 adapter-only reads.
discarded 60 N reads.fastx_trimmer -v -f 1 -l 27 -i BC54.clipped.fa -o BC54.trimmed.fa
Trimming: base 1 to 27
Input: 92533 reads.
Output: 92533 reads.fastx_collapser -v -i BC54.trimmed.fa -o BC54.collapsed.fa
Collapsd 92533 reads into 36431 unique sequences.fasta_clipping_histogram.pl BC54.collapsed.fa bc54_clipping.png
通常這些可寫在一個shell腳本里
cat BC54.fq | fastq_to_fasta -n | fastx_clipper -l 15 -a CTGTAGGCACCATCAATTCGTA | fastx_trimmer -f 1 -l 27 | fastx_collapser > bc54.final.fa