Cufflinks下主要包含cufflinks,cuffmerge,cuffcompare和cuffdiff等幾支主要的程序。主要用于轉(zhuǎn)錄組的組裝和差異表達分析。
一怀吻、Cuffdiff簡介
用于尋找轉(zhuǎn)錄子表達的顯著性差異蝉仇。
二、Cuffdiff使用方法
(默認Linux操作洼裤,此處省略安裝步驟)
cuffdiff主要是發(fā)現(xiàn)轉(zhuǎn)錄本表達,剪接溪王,啟動子使用的明顯變化腮鞍。
Usage: cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam]
Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam
其中transcripts.gtf是由cufflinks,cuffcompare莹菱,cuffmerge所生成的文件移国,或是由其它程序生成的。一個樣本有多個replicate道伟,用逗號隔開迹缀。sample多于一個時使碾,cuffdiff將比較samples間的基因表達的差異性。
cuffdiff接受bam/sam或cuffquant的CXB文件祝懂,同時也可以接受bam與sam的混合文件票摇,不能接受bam/sam和CXB的混合文件。
三砚蓬、Cuffdiff參數(shù)說明
General Options:
-o/--output-dir write all output files to this directory [ default: ./ ] #輸出的文件夾目錄
-L/--labels comma-separated list of condition labels #給每個sample一個樣品名或者一個環(huán)境條件一個lable
--FDR False discovery rate used in testing [ default: 0.05 ] #允許的false discovery rate
-M/--mask-file ignore all alignment within transcripts in this file [ default: NULL ] #提供GFF文件矢门。Cufflinks將忽略比對到該GTF文件的transcripts中的reads。該文件中常常是rRNA的注釋灰蛙,也可以包含線立體和其它希望忽略的transcripts的注釋祟剔。將這些不需要的RNA去除后,對計算mRNA的表達量是有利的摩梧。
-C/--contrast-file Perform the constrasts specified in this file [ default: NULL ] #比對指定文件
-b/--frag-bias-correct use bias correction - reference fasta required [ default: NULL ] #提供一個fasta文件來指導Cufflinks運行新的bias detection and correction algorithm物延。這樣能明顯提高轉(zhuǎn)錄子豐度計算的精確性。
-u/--multi-read-correct use 'rescue method' for multi-reads [ default: FALSE ] #讓Cufflinks來做initial estimation步驟仅父,從而更精確衡量比對到genome多個位點的reads教届。
-p/--num-threads number of threads used during quantification [ default: 1 ] #使用的CPU線程數(shù)
--no-diff Don't generate differential analysis files [ default: FALSE ] #不需要生成差異分析文件
--no-js-tests Don't perform isoform switching tests [ default: FALSE ] #不需要進行isoform轉(zhuǎn)換測試
-T/--time-series treat samples as a time-series [ default: FALSE ] #讓Cuffdiff來按樣品順序來比對樣品,而不是對所有的samples都進行兩兩比對驾霜。即第二個SAM和第一個SAM比案训;第三個SAM和第二個SAM比;第四個SAM和第三個SAM比...
--library-type Library prep used for input reads [ default: below ] #處理的reads具有鏈特異性粪糙。比對結(jié)果中將會有個XS標簽强霎。一般Illumina數(shù)據(jù)的library-type為 fr-unstranded。
--dispersion-method Method used to estimate dispersion models [ default: below ] #用于估計分散模型的方法
--library-norm-method Method used to normalize library sizes [ default: below ] #用于標準化庫大小的方法
四蓉冈、Cuffdiff輸出
1. FPKM tracking files
cuffdiff計算每個樣本中的轉(zhuǎn)錄本城舞,初始轉(zhuǎn)錄本和基因的FPKM。其中寞酿,基因和初始轉(zhuǎn)錄本的FPKM的計算是在每個轉(zhuǎn)錄本group和基因group中的轉(zhuǎn)錄本的FPKM的求和家夺。
1|isoforms.fpkm_tracking Transcript FPKMs
2|genes.fpkm_tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id
3|cds.fpkm_tracking Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id
4|tss_groups.fpkm_tracking Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id
文件格式:
1|tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage P1_FPKM P1_conf_lo P1_conf_hi P1_status P2_FPKM P2_conf_lo P2_conf_hi P2_status P3_FPKM P3_conf_lo P3_conf_hi P3_status
2|ENST00000000233 - - ENSG00000004059 ARF5 - 7:127220671-127242198 1103 - 58.3768 0 139.888 OK 47.3478 0 113.046 OK 78.9705 0 184.419 OK
注:P1、P2伐弹、P3為樣本名
2. Count tracking files
評估每個樣本中來自每個 transcript, primary transcript, and gene的fragment數(shù)目拉馋。其中primary transcript, and gene的fragment數(shù)目是每個primary transcript group或gene group中trancript的數(shù)目之和。
1|isoforms.count_tracking Transcript counts
2|genes.count_tracking Gene counts. Tracks the summed counts of transcripts sharing each gene_id
3|cds.count_tracking Coding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id
4|tss_groups.count_tracking Primary transcript counts. Tracks the summed counts of transcripts sharing each tss_id
文件格式:
1|tracking_id P1_count P1_count_variance P1_count_uncertainty_var P1_count_dispersion_var P1_status P2_count P2_count_variance P2_count_uncertainty_var P2_count_dispersion_var P2_status P3_count P3_count_variance P3_count_uncertainty_var P3_count_dispersion_var P3_status
2|ENST00000000233 1226.79 733396 0 591186 OK 992.56 474498 0 376391 OK 1661.82 1.22994e+06 0 1.22994e+06 OK
3. Read group tracking files
計算在每個repulate中每個transcript惨好, primary transcript和gene的表達量和frage數(shù)目煌茴。
1|isoforms.read_group_tracking Transcript read group tracking
2|genes.read_group_tracking Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene_id in each replicate
3|cds.read_group_tracking Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate
4|tss_groups.read_group_tracking Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate
文件格式:
1|tracking_id condition replicate raw_frags internal_scaled_frags external_scaled_frags FPKM effective_length status
2|ENST00000000233 MofRCC 1 1307.38 1182.81 1182.81 56.4898 - OK
4. Differential expression test files
對于splicing transcript,primary transcripts, genes, and coding sequences.樣本之間的表達差異檢驗日川。對于每一對樣本x和y蔓腐,都會有以下四個文件:
1|isoform_exp.diff Transcript differential FPKM.
2|gene_exp.diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id
3|tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
4|cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id
文件格式:
1|test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
2|ENST00000000233 ENSG00000004059 ARF5 7:127220671-127242198 MofRCC NofRCC OK 58.3768 47.3478 -0.302097 -0.212748 0.7584 0.992833 no
5. Differential splicing tests – splicing.diff
對于每個primary transcript,鑒定的不同的isoform的差異性龄句。只有2個或2個以上的isoforms的primary transcript存在回论。
文件格式:
1|test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 sqrt(JS) test_stat p_value q_value significant
6. Differential coding output – cds.diff
對于每個基因散罕,它的cds的鑒定。樣本間的輸出cds的差異性傀蓉。只有2個或2個以上的cds(multi-protein genes)列舉在文件中笨使。
文件格式:
1|test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 sqrt(JS) test_stat p_value q_value significant
7. Differential promoter use – promoters.diff
樣本間啟動子使用的差異性。只有表達2個或2個以上isoform的基因列舉在這里僚害。
文件格式:
1|test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 sqrt(JS) test_stat p_value q_value significant
8. Read group info – read_groups.info
每個repulate硫椰,在進行定量分析時,cuffdiff的關(guān)鍵屬性會列出萨蚕。
文件格式:
1|file condition replicate_num total_mass norm_mass internal_scale external_scale
2|/PROJ/*/Quantification/P1/abundances.cxb MofRCC 0 2.8904e+07 2.44127e+07 1.20839 1
9. Run info – run.info
運行信息靶草。