FASTX-Toolkit

FASTX-Toolkit介紹

背景介紹

高通量測序數(shù)據(jù)下機后的原始fastq文件,包含4行者铜,其中一行為質量值趁俊,另外一行則為對應序列贤牛,高通量的數(shù)據(jù)處理首先要進行質量控制,這些過程包括去接頭则酝、過濾低質量reads、去除低質量的3’和5’端,去除N較多的reads等沽讹,針對高通量測序數(shù)據(jù)的質控軟件有很多般卑,在此介紹質控工具:fastx_toolkit

FASTX-Toolkit

FASTX-Toolkit是用于短讀FASTA / FASTQ文件預處理的命令行工具的集合。 新一代測序儀通常生成FASTA或FASTQ文件爽雄,包含多個短讀序列(可能帶有質量信息)蝠检。 這種FASTA / FASTQ文件的主要處理是使用專門程序將序列映射(也稱為比對)到參考基因組或其他數(shù)據(jù)庫。 這種映射程序的示例是:Blat挚瘟,SHRiMP叹谁,LastZ,MAQ以及許多其他程序乘盖。 但是焰檩,在將序列映射到基因組之前預處理FASTA / FASTQ文件有時會更有效率 - 操作序列以產(chǎn)生更好的映射結果。 FASTX-Toolkit工具執(zhí)行其中一些預處理任務订框。

可用工具

  • FASTQ-to-FASTA converter
    Convert FASTQ files to FASTA files.
    將FASTQ文件轉換為FASTA文件
  • FASTQ Information
    Chart Quality Statistics and Nucleotide Distribution
    圖表質量統(tǒng)計和核苷酸分布
  • FASTQ/A Collapser
    Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts)
    將FASTQ / A文件中的相同序列折疊成單個序列(同時保持讀取計數(shù))
  • FASTQ/A Trimmer
    Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise)
    縮短FASTQ或FASTQ文件中的讀數(shù)析苫。
  • FASTQ/A Renamer
    Renames the sequence identifiers in FASTQ/A file
    在FASTQ / A文件中重命名序列標識符
  • FASTQ/A Clipper
    Removing sequencing adapters / linkers
    刪除測序適配器/連接器
  • FASTQ/A Reverse-Complement
    Producing the Reverse-complement of each sequence in a FASTQ/FASTA file
    在FASTQ / FASTA文件中生成每個序列的反向補碼
  • FASTQ/A Barcode splitter
    Splitting a FASTQ/FASTA files containning multiple samples
    拆分包含多個樣本的FASTQ / FASTA文件
  • FASTA Formatter
    changes the width of sequences line in a FASTA file
    更改FASTA文件中序列行的寬度
  • FASTA Nucleotide Changer
    Convets FASTA sequences from/to RNA/DNA
    將FASTA序列從/轉換為RNA / DNA
  • FASTQ Quality Filter
    Filters sequences based on quality
    根據(jù)質量過濾序列
  • FASTQ Quality Trimmer
    Trims (cuts) sequences based on quality
    根據(jù)質量修剪(剪切)序列
  • FASTQ Masker
    Masks nucleotides with 'N' (or other character) based on quality
    根據(jù)質量,使用'N'(或其他字符)掩蔽核苷酸

下載

下載地址:fastx_toolkit下載鏈接

wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
tar xjvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2

使用

注意事項

fastx_toolkit由一系列的命令組成穿扳,每個命令提供一個實用的小功能衩侥。在使用時需要注意以下幾點:

  • 不支持壓縮格式的輸入文件
  • 不允許序列中存在N堿基,這樣的序列會自動去除
  • 可視化命令依賴gunplot軟件和perl的GD模塊
  • 默認情況下認為fastq文件的堿基編碼格式為phred64

在安裝該軟件時尤其時運時如果遇到:make命令報錯:“fgets called with bigger size than length of destination buffer”矛物,安裝比較新版本茫死,就能解決問題。

如果在運行fastx_quality_stats 過程中出現(xiàn)“fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4”履羞,請在參數(shù)中加入“-Q 33”

參數(shù)及其使用

FASTQ-to-FASTA

usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-r]         = Rename sequence identifiers to numbers.
   [-n]         = keep sequences with unknown (N) nucleotides.
          Default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA output file. default is STDOUT.

FASTX Statistics

usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]

version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
   [-h] = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
                  If FASTA file is given, only nucleotides
          distribution is calculated (there's no quality info).
   [-o OUTFILE] = TEXT output file. default is STDOUT.

The output TEXT file will have the following fields (one row per column):
    column  = column number (1 to 36 for a 36-cycles read solexa file)
    count   = number of bases found in this column.
    min     = Lowest quality score value found in this column.
    max     = Highest quality score value found in this column.
    sum     = Sum of quality score values for this column.
    mean    = Mean quality score value for this column.
    Q1  = 1st quartile quality score.
    med = Median quality score.
    Q3  = 3rd quartile quality score.
    IQR = Inter-Quartile range (Q3-Q1).
    lW  = 'Left-Whisker' value (for boxplotting).
    rW  = 'Right-Whisker' value (for boxplotting).
    A_Count = Count of 'A' nucleotides found in this column.
    C_Count = Count of 'C' nucleotides found in this column.
    G_Count = Count of 'G' nucleotides found in this column.
    T_Count = Count of 'T' nucleotides found in this column.
    N_Count = Count of 'N' nucleotides found in this column.
    max-count = max. number of bases (in all cycles)

FASTQ Quality Chart

Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title (usually the solexa file name) - will be plotted on the graph.

FASTA/Q Nucleotide Distribution

Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title - will be plotted on the graph.

FASTA/Q Clipper

usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
   [-l N]       = discard sequences shorter than N nucleotides. default is 5.
   [-d N]       = Keep the adapter and N bases after it.
          (using '-d 0' is the same as not using '-d' at all. which is the default).
   [-c]         = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
   [-C]         = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
   [-k]         = Report Adapter-Only sequences.
   [-n]         = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-D]     = DEBUG output.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA/Q Renamer

usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

   [-n TYPE]    = rename type:
          SEQ - use the nucleotides sequence as the name.
          COUNT - use simply counter as the name.
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA/Q Trimmer

usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-f N]       = First base to keep. Default is 1 (=first base).
   [-l N]       = Last base to keep. Default is entire read.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA/Q Collapser

usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-v]         = verbose: print short summary of input/output counts
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTQ/A Artifacts Filter

usage: fastq_artifacts_filter [-h] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-z]         = Compress output with GZIP.
   [-v]         = Verbose - report number of processed reads.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.

FASTQ Quality Filter

usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.

FASTQ/A Reverse Complement

usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

FASTA Formatter

usage: fasta_formatter [-h] [-i INFILE] [-o OUTFILE] [-w N] [-t] [-e]
Part of FASTX Toolkit 0.0.7 by gordon@cshl.edu

   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-w N]       = max. sequence line width for output FASTA file.
          When ZERO (the default), sequence lines will NOT be wrapped -
          all nucleotides of each sequences will appear on a single 
          line (good for scripting).
   [-t]         = Output tabulated format (instead of FASTA format).
          Sequence-Identifiers will be on first column,
          Nucleotides will appear on second column (as single line).
   [-e]         = Output empty sequences (default is to discard them).
          Empty sequences are ones who have only a sequence identifier,
          but not actual nucleotides.

Example: FASTQ Information

$ fastx_quality_stats -i BC54.fq -o bc54_stats.txt
$ fastq_quality_boxplot_graph.sh -i bc54_stats.txt -o bc54_quality.png -t "My Library"
$ fastx_nucleotide_distribution_graph.sh -i bc54_stats.txt -o bc54_nuc.png -t "My Library"
bc54_quality.png

bc54_nuc.png

Example: FASTQ/A Manipulation

Common pre-processing work-flow:

  1. Covnerting FASTQ to FASTA
  2. Clipping the Adapter/Linker
  3. Trimming to 27nt (if you're analyzing miRNAs, for example)
  4. Collapsing the sequences
  5. Plotting the clipping results

Using the FASTX-toolkit from the command line:

  • fastq_to_fasta -v -n -i BC54.fq -o BC54.fa Input: 100000 reads.
    Output: 100000 reads.

  • fastx_clipper -v -i BC54.fa -a CTGTAGGCACCATCAATTCGTA -o BC54.clipped.fa
    Clipping Adapter: CTGTAGGCACCATCAATTCGTA
    Min. Length: 15
    Input: 100000 reads.
    Output: 92533 reads.
    discarded 468 too-short reads.
    discarded 6939 adapter-only reads.
    discarded 60 N reads.

  • fastx_trimmer -v -f 1 -l 27 -i BC54.clipped.fa -o BC54.trimmed.fa
    Trimming: base 1 to 27
    Input: 92533 reads.
    Output: 92533 reads.

  • fastx_collapser -v -i BC54.trimmed.fa -o BC54.collapsed.fa
    Collapsd 92533 reads into 36431 unique sequences.

  • fasta_clipping_histogram.pl BC54.collapsed.fa bc54_clipping.png

通常這些可寫在一個shell腳本里

cat BC54.fq | fastq_to_fasta -n | fastx_clipper -l 15 -a CTGTAGGCACCATCAATTCGTA | fastx_trimmer -f 1 -l 27 | fastx_collapser > bc54.final.fa
bc54_clipping.png

參考

fastx_toolkit/commandline

https://www.cnblogs.com/zkkaka/p/6146293.html

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末峦萎,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子吧雹,更是在濱河造成了極大的恐慌骨杂,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,839評論 6 482
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件雄卷,死亡現(xiàn)場離奇詭異搓蚪,居然都是意外死亡,警方通過查閱死者的電腦和手機丁鹉,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,543評論 2 382
  • 文/潘曉璐 我一進店門妒潭,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人揣钦,你說我怎么就攤上這事雳灾。” “怎么了冯凹?”我有些...
    開封第一講書人閱讀 153,116評論 0 344
  • 文/不壞的土叔 我叫張陵谎亩,是天一觀的道長。 經(jīng)常有香客問我,道長匈庭,這世上最難降的妖魔是什么夫凸? 我笑而不...
    開封第一講書人閱讀 55,371評論 1 279
  • 正文 為了忘掉前任,我火速辦了婚禮阱持,結果婚禮上夭拌,老公的妹妹穿的比我還像新娘。我一直安慰自己衷咽,他們只是感情好鸽扁,可當我...
    茶點故事閱讀 64,384評論 5 374
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著镶骗,像睡著了一般桶现。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上卖词,一...
    開封第一講書人閱讀 49,111評論 1 285
  • 那天巩那,我揣著相機與錄音,去河邊找鬼此蜈。 笑死即横,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的裆赵。 我是一名探鬼主播东囚,決...
    沈念sama閱讀 38,416評論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼战授!你這毒婦竟也來了页藻?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 37,053評論 0 259
  • 序言:老撾萬榮一對情侶失蹤植兰,失蹤者是張志新(化名)和其女友劉穎份帐,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體楣导,經(jīng)...
    沈念sama閱讀 43,558評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡废境,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,007評論 2 325
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了筒繁。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片噩凹。...
    茶點故事閱讀 38,117評論 1 334
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖毡咏,靈堂內(nèi)的尸體忽然破棺而出驮宴,到底是詐尸還是另有隱情,我是刑警寧澤呕缭,帶...
    沈念sama閱讀 33,756評論 4 324
  • 正文 年R本政府宣布堵泽,位于F島的核電站修己,受9級特大地震影響,放射性物質發(fā)生泄漏迎罗。R本人自食惡果不足惜箩退,卻給世界環(huán)境...
    茶點故事閱讀 39,324評論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望佳谦。 院中可真熱鬧,春花似錦滋戳、人聲如沸钻蔑。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,315評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽咪笑。三九已至,卻和暖如春娄涩,著一層夾襖步出監(jiān)牢的瞬間窗怒,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,539評論 1 262
  • 我被黑心中介騙來泰國打工蓄拣, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留扬虚,地道東北人。 一個月前我還...
    沈念sama閱讀 45,578評論 2 355
  • 正文 我出身青樓球恤,卻偏偏與公主長得像辜昵,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子咽斧,可洞房花燭夜當晚...
    茶點故事閱讀 42,877評論 2 345

推薦閱讀更多精彩內(nèi)容