FASTX-Toolkit

FASTX-Toolkit介紹

背景介紹

高通量測序數(shù)據(jù)下機后的原始fastq文件，包含4行者铜，其中一行為質量值趁俊，另外一行則為對應序列贤牛，高通量的數(shù)據(jù)處理首先要進行質量控制，這些過程包括去接頭则酝、過濾低質量reads、去除低質量的3’和5’端，去除N較多的reads等沽讹，針對高通量測序數(shù)據(jù)的質控軟件有很多般卑，在此介紹質控工具：fastx_toolkit

FASTX-Toolkit

FASTX-Toolkit是用于短讀FASTA / FASTQ文件預處理的命令行工具的集合。新一代測序儀通常生成FASTA或FASTQ文件爽雄，包含多個短讀序列（可能帶有質量信息）蝠检。這種FASTA / FASTQ文件的主要處理是使用專門程序將序列映射（也稱為比對）到參考基因組或其他數(shù)據(jù)庫。這種映射程序的示例是：Blat挚瘟，SHRiMP叹谁，LastZ，MAQ以及許多其他程序乘盖。但是焰檩，在將序列映射到基因組之前預處理FASTA / FASTQ文件有時會更有效率 - 操作序列以產(chǎn)生更好的映射結果。 FASTX-Toolkit工具執(zhí)行其中一些預處理任務订框。

可用工具

FASTQ-to-FASTA converter
Convert FASTQ files to FASTA files.
將FASTQ文件轉換為FASTA文件

FASTQ Information
Chart Quality Statistics and Nucleotide Distribution
圖表質量統(tǒng)計和核苷酸分布

FASTQ/A Collapser
Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
將FASTQ / A文件中的相同序列折疊成單個序列（同時保持讀取計數(shù)）

FASTQ/A Trimmer
Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise)
縮短FASTQ或FASTQ文件中的讀數(shù)析苫。

FASTQ/A Renamer
Renames the sequence identifiers in FASTQ/A file
在FASTQ / A文件中重命名序列標識符

FASTQ/A Clipper
Removing sequencing adapters / linkers
刪除測序適配器/連接器

FASTQ/A Reverse-Complement
Producing the Reverse-complement of each sequence in a FASTQ/FASTA file
在FASTQ / FASTA文件中生成每個序列的反向補碼

FASTQ/A Barcode splitter
Splitting a FASTQ/FASTA files containning multiple samples
拆分包含多個樣本的FASTQ / FASTA文件

FASTA Formatter
changes the width of sequences line in a FASTA file
更改FASTA文件中序列行的寬度

FASTA Nucleotide Changer
Convets FASTA sequences from/to RNA/DNA
將FASTA序列從/轉換為RNA / DNA

FASTQ Quality Filter
Filters sequences based on quality
根據(jù)質量過濾序列

FASTQ Quality Trimmer
Trims (cuts) sequences based on quality
根據(jù)質量修剪（剪切）序列

FASTQ Masker
Masks nucleotides with 'N' (or other character) based on quality
根據(jù)質量，使用'N'（或其他字符）掩蔽核苷酸

下載

下載地址：fastx_toolkit下載鏈接

wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
tar xjvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2

使用

注意事項

fastx_toolkit由一系列的命令組成穿扳，每個命令提供一個實用的小功能衩侥。在使用時需要注意以下幾點:

不支持壓縮格式的輸入文件

不允許序列中存在N堿基，這樣的序列會自動去除

可視化命令依賴gunplot軟件和perl的GD模塊

默認情況下認為fastq文件的堿基編碼格式為phred64

在安裝該軟件時尤其時運時如果遇到:make命令報錯：“fgets called with bigger size than length of destination buffer”矛物，安裝比較新版本茫死，就能解決問題。

如果在運行fastx_quality_stats 過程中出現(xiàn)“fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4”履羞，請在參數(shù)中加入“-Q 33”

參數(shù)及其使用

FASTQ-to-FASTA

usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-r]         = Rename sequence identifiers to numbers.
   [-n]         = keep sequences with unknown (N) nucleotides.
          Default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA output file. default is STDOUT.

FASTX Statistics

usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]

version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
   [-h] = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
                  If FASTA file is given, only nucleotides
          distribution is calculated (there's no quality info).
   [-o OUTFILE] = TEXT output file. default is STDOUT.

The output TEXT file will have the following fields (one row per column):
    column  = column number (1 to 36 for a 36-cycles read solexa file)
    count   = number of bases found in this column.
    min     = Lowest quality score value found in this column.
    max     = Highest quality score value found in this column.
    sum     = Sum of quality score values for this column.
    mean    = Mean quality score value for this column.
    Q1  = 1st quartile quality score.
    med = Median quality score.
    Q3  = 3rd quartile quality score.
    IQR = Inter-Quartile range (Q3-Q1).
    lW  = 'Left-Whisker' value (for boxplotting).
    rW  = 'Right-Whisker' value (for boxplotting).
    A_Count = Count of 'A' nucleotides found in this column.
    C_Count = Count of 'C' nucleotides found in this column.
    G_Count = Count of 'G' nucleotides found in this column.
    T_Count = Count of 'T' nucleotides found in this column.
    N_Count = Count of 'N' nucleotides found in this column.
    max-count = max. number of bases (in all cycles)

FASTQ Quality Chart

Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title (usually the solexa file name) - will be plotted on the graph.

FASTA/Q Nucleotide Distribution

Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title - will be plotted on the graph.

FASTA/Q Clipper

usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
   [-l N]       = discard sequences shorter than N nucleotides. default is 5.
   [-d N]       = Keep the adapter and N bases after it.
          (using '-d 0' is the same as not using '-d' at all. which is the default).
   [-c]         = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
   [-C]         = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
   [-k]         = Report Adapter-Only sequences.
   [-n]         = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-D]     = DEBUG output.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA/Q Renamer

usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

   [-n TYPE]    = rename type:
          SEQ - use the nucleotides sequence as the name.
          COUNT - use simply counter as the name.
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA/Q Trimmer

usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-f N]       = First base to keep. Default is 1 (=first base).
   [-l N]       = Last base to keep. Default is entire read.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA/Q Collapser

usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-v]         = verbose: print short summary of input/output counts
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTQ/A Artifacts Filter

usage: fastq_artifacts_filter [-h] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-z]         = Compress output with GZIP.
   [-v]         = Verbose - report number of processed reads.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.

FASTQ Quality Filter

usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.

FASTQ/A Reverse Complement

usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA Formatter

usage: fasta_formatter [-h] [-i INFILE] [-o OUTFILE] [-w N] [-t] [-e]
Part of FASTX Toolkit 0.0.7 by gordon@cshl.edu

   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-w N]       = max. sequence line width for output FASTA file.
          When ZERO (the default), sequence lines will NOT be wrapped -
          all nucleotides of each sequences will appear on a single 
          line (good for scripting).
   [-t]         = Output tabulated format (instead of FASTA format).
          Sequence-Identifiers will be on first column,
          Nucleotides will appear on second column (as single line).
   [-e]         = Output empty sequences (default is to discard them).
          Empty sequences are ones who have only a sequence identifier,
          but not actual nucleotides.

Example: FASTQ Information

$ fastx_quality_stats -i BC54.fq -o bc54_stats.txt
$ fastq_quality_boxplot_graph.sh -i bc54_stats.txt -o bc54_quality.png -t "My Library"
$ fastx_nucleotide_distribution_graph.sh -i bc54_stats.txt -o bc54_nuc.png -t "My Library"

bc54_quality.png

bc54_nuc.png

Example: FASTQ/A Manipulation

Common pre-processing work-flow:

Covnerting FASTQ to FASTA
Clipping the Adapter/Linker
Trimming to 27nt (if you're analyzing miRNAs, for example)
Collapsing the sequences
Plotting the clipping results

Using the FASTX-toolkit from the command line:

fastq_to_fasta -v -n -i BC54.fq -o BC54.fa Input: 100000 reads.
Output: 100000 reads.
fastx_clipper -v -i BC54.fa -a CTGTAGGCACCATCAATTCGTA -o BC54.clipped.fa
Clipping Adapter: CTGTAGGCACCATCAATTCGTA
Min. Length: 15
Input: 100000 reads.
Output: 92533 reads.
discarded 468 too-short reads.
discarded 6939 adapter-only reads.
discarded 60 N reads.
fastx_trimmer -v -f 1 -l 27 -i BC54.clipped.fa -o BC54.trimmed.fa
Trimming: base 1 to 27
Input: 92533 reads.
Output: 92533 reads.
fastx_collapser -v -i BC54.trimmed.fa -o BC54.collapsed.fa
Collapsd 92533 reads into 36431 unique sequences.
fasta_clipping_histogram.pl BC54.collapsed.fa bc54_clipping.png

通常這些可寫在一個shell腳本里

cat BC54.fq | fastq_to_fasta -n | fastx_clipper -l 15 -a CTGTAGGCACCATCAATTCGTA | fastx_trimmer -f 1 -l 27 | fastx_collapser > bc54.final.fa

bc54_clipping.png

參考

fastx_toolkit/commandline

https://www.cnblogs.com/zkkaka/p/6146293.html

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末峦萎，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子吧雹，更是在濱河造成了極大的恐慌骨杂，老刑警劉巖，帶你破解...
沈念sama閱讀 206,839評論 6贊 482
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件雄卷，死亡現(xiàn)場離奇詭異搓蚪，居然都是意外死亡，警方通過查閱死者的電腦和手機丁鹉，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,543評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門妒潭，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人揣钦，你說我怎么就攤上這事雳灾。” “怎么了冯凹？”我有些...
開封第一講書人閱讀 153,116評論 0贊 344
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵谎亩，是天一觀的道長。經(jīng)常有香客問我，道長匈庭，這世上最難降的妖魔是什么夫凸？我笑而不...
開封第一講書人閱讀 55,371評論 1贊 279
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮阱持，結果婚禮上夭拌，老公的妹妹穿的比我還像新娘。我一直安慰自己衷咽，他們只是感情好鸽扁，可當我...
茶點故事閱讀 64,384評論 5贊 374
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著镶骗，像睡著了一般桶现。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上卖词，一...
開封第一講書人閱讀 49,111評論 1贊 285
城市分裂傳說
那天巩那，我揣著相機與錄音，去河邊找鬼此蜈。笑死即横，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的裆赵。我是一名探鬼主播东囚，決...
沈念sama閱讀 38,416評論 3贊 400
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼战授！你這毒婦竟也來了页藻？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 37,053評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤植兰，失蹤者是張志新（化名）和其女友劉穎份帐，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體楣导，經(jīng)...
沈念sama閱讀 43,558評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡废境，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,007評論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了筒繁。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片噩凹。...
茶點故事閱讀 38,117評論 1贊 334
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖毡咏，靈堂內(nèi)的尸體忽然破棺而出驮宴，到底是詐尸還是另有隱情，我是刑警寧澤呕缭，帶...
沈念sama閱讀 33,756評論 4贊 324
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布堵泽，位于F島的核電站修己，受9級特大地震影響，放射性物質發(fā)生泄漏迎罗。R本人自食惡果不足惜箩退，卻給世界環(huán)境...
茶點故事閱讀 39,324評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望佳谦。院中可真熱鬧，春花似錦滋戳、人聲如沸钻蔑。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,315評論 0贊 19
一樁弒父案奸鸯，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽咪笑。三九已至，卻和暖如春娄涩，著一層夾襖步出監(jiān)牢的瞬間窗怒，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,539評論 1贊 262
情欲美人皮
我被黑心中介騙來泰國打工蓄拣，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留扬虚，地道東北人。一個月前我還...
沈念sama閱讀 45,578評論 2贊 355
代替公主和親
正文我出身青樓球恤，卻偏偏與公主長得像辜昵，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子咽斧，可洞房花燭夜當晚...
茶點故事閱讀 42,877評論 2贊 345

FASTX-Toolkit

FASTX-Toolkit介紹

背景介紹

FASTX-Toolkit

可用工具

下載

使用

注意事項

參數(shù)及其使用

FASTQ-to-FASTA

FASTX Statistics

FASTQ Quality Chart

FASTA/Q Nucleotide Distribution

FASTA/Q Clipper

FASTA/Q Renamer

FASTA/Q Trimmer

FASTA/Q Collapser

FASTQ/A Artifacts Filter

FASTQ Quality Filter

FASTQ/A Reverse Complement

FASTA Formatter

Example: FASTQ Information

Example: FASTQ/A Manipulation

參考

推薦閱讀更多精彩內(nèi)容