通過前期的基因定量壤蚜,我們通過RSEM或者Kallisto這些工具得到了每個(gè)基因或者轉(zhuǎn)錄本的表達(dá)量矩陣蘑拯。輸出的矩陣包含了基因以及在各個(gè)樣本中的表達(dá)量信息,這些信息將為我們后期判定基因表達(dá)量變化趨勢(shì)以及變化的顯著性提供原始參考。
Trinity工具包同樣也提供了一系列的腳本用以處理這些基因表達(dá)量變化差異顯著性分析,在Trinity中主要使用的是EdgeR包進(jìn)行處理抛寝。
在此之前我們需要注意一點(diǎn):對(duì)于基因表達(dá)量分析由于是要對(duì)比不同樣本,因此需要每個(gè)基因在不同樣本中的表達(dá)量均有記錄(哪怕是無表達(dá)量的0也可)曙旭。因此盗舰,Trinity建議在計(jì)算基因表達(dá)量時(shí)采用的參考轉(zhuǎn)錄本庫應(yīng)該是一個(gè)庫,即先通過所有樣本原始數(shù)據(jù)進(jìn)行拼接而得到統(tǒng)一的轉(zhuǎn)錄本庫桂躏,進(jìn)而依照該轉(zhuǎn)錄本庫進(jìn)行回帖計(jì)算表達(dá)量钻趋。
在這個(gè)工作之前需要兩個(gè)數(shù)據(jù):
1. 基因表達(dá)的counts.matrix 文件
2. 生物學(xué)重復(fù)的表文件(由于后期比較的時(shí)候采用的配對(duì)顯著性驗(yàn)證,因此會(huì)將每個(gè)樣本之間進(jìn)行比較沼头,而存在生物學(xué)重復(fù)的類型需要通過該文件指明生物學(xué)重復(fù)情況)
此外爷绘,由于Trinity調(diào)用了多個(gè)R包,因此需要在進(jìn)行運(yùn)算前安裝這些R包
% R
> source("http://bioconductor.org/biocLite.R") #由于包都是Bioconductor上的进倍,因此需要安裝這個(gè)東西
> biocLite('edgeR')
> biocLite('limma')
> biocLite('DESeq2')
> biocLite('ctc')
> biocLite('Biobase')
> install.packages('gplots')
> install.packages('ape')
接下來獲取腳本幫助信息
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl -h
#################################################################################################
#
# Required:
#
# --matrix|m <string> matrix of raw read counts (not normalized!)
#
# --method <string> edgeR|DESeq2|voom|ROTS
# note: you should have biological replicates.
# edgeR will support having no bio replicates with
# a fixed dispersion setting.
#
# Optional:
#
# --samples_file|s <string> tab-delimited text file indicating biological replicate relationships.
# ex.
# cond_A cond_A_rep1
# cond_A cond_A_rep2
# cond_B cond_B_rep1
# cond_B cond_B_rep2
#
#
# General options:
#
# --min_reps_min_cpm <string> default: 2,1 (format: 'min_reps,min_cpm')
# At least min count of replicates must have cpm values > min cpm value.
# (ie. filtMatrix = matrix[rowSums(cpm(matrix)> min_cpm) >= min_reps, ] adapted from edgeR manual)
# Note, ** if no --samples_file, default for min_reps is set = 1 **
#
# --output|o name of directory to place outputs (default: $method.$pid.dir)
#
# --reference_sample <string> name of a sample to which all other samples should be compared.
# (default is doing all pairwise-comparisons among samples)
#
# --contrasts <string> file (tab-delimited) containing the pairs of sample comparisons to perform.
# ex.
# cond_A cond_B
# cond_Y cond_Z
#
#
###############################################################################################
#
# ## EdgeR-related parameters
# ## (no biological replicates)
#
# --dispersion <float> edgeR dispersion value (Read edgeR manual to guide your value choice)
# http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
# ## ROTS parameters
# --ROTS_B <int> : number of bootstraps and permutation resampling (default: 500)
# --ROTS_K <int> : largest top genes size (default: 5000)
#
#
###############################################################################################
#
# Documentation and manuals for various DE methods. Please read for more advanced and more
# fine-tuned DE analysis than provided by this helper script.
#
# edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
# DESeq2: http://bioconductor.org/packages/release/bioc/html/DESeq2.html
# voom/limma: http://bioconductor.org/packages/release/bioc/html/limma.html
# ROTS: http://www.btk.fi/research/research-groups/elo/software/rots/
#
###############################################################################################
可以看到Trinity提供的這個(gè)腳本是可以支持edgeR、DESeq2购对、voom/limma以及ROTS的猾昆,也就是說后期需要指定使用哪個(gè)軟件進(jìn)行分析。
EdgeR軟件可以支持沒有生物學(xué)重復(fù)的實(shí)驗(yàn)數(shù)據(jù)骡苞。通過指定dispersion參數(shù)進(jìn)行設(shè)定垂蜗,該參數(shù)很重要需要參考edgeR的使用說明進(jìn)行設(shè)定。
我們的數(shù)據(jù)是有生物學(xué)重復(fù)的因此采用以下腳本進(jìn)行:
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ cat samples.txt
B25 B251
B25 B252
R25 R251
R25 R252
W25 W251
W25 W252
#查看樣品信息三個(gè)處理兩個(gè)生物學(xué)重復(fù)
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$ $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \#調(diào)用腳本
--matrix RSEM.isoform.counts.matrix \#采用的表達(dá)矩陣
--method edgeR \#采用的比對(duì)方式
--samples_file samples.txt \#
--output DEEdgeRout/ #結(jié)果輸出到一個(gè)單獨(dú)的文件夾
運(yùn)行完了過后我們就去看看結(jié)果
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout$ cd DEEdgeRout/yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$ l
RSEM.isoform.counts.matrix.B25_vs_R25.B25.vs.R25.EdgeR.Rscript
RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.count_matrix
RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results
RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results.MA_n_Volcano.pdf
RSEM.isoform.counts.matrix.B25_vs_W25.B25.vs.W25.EdgeR.Rscript
RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.count_matrix
RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.DE_results
RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.DE_results.MA_n_Volcano.pdf
RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.count_matrix
RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.DE_results
RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.DE_results.MA_n_Volcano.pdf
RSEM.isoform.counts.matrix.R25_vs_W25.R25.vs.W25.EdgeR.Rscript
12個(gè)文件共4類 包括
- Rscript(包括分析的腳本解幽,通過這個(gè)腳本的修改可以調(diào)整參數(shù)和輸出的圖片)
- cout_matrix (提取出來的兩兩比對(duì)的數(shù)據(jù))
- DE_results (這個(gè)是比對(duì)結(jié)果)
- PDF (可視化的圖片)
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$ head RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results
sampleA sampleB logFC logCPM PValue FDR
TRINITY_DN4700_c2_g1_i7 B25 R25 10.6165324802447 7.00741992505597 1.40250971832104e-141 7.67551493545557e-137
TRINITY_DN494_c0_g1_i9 B25 R25 11.3576752267624 6.27009842572613 5.23345757616711e-124 1.43205716385449e-119
TRINITY_DN26881_c0_g1_i3 B25 R25 14.7687111138408 6.44776948986216 1.07843333864274e-123 1.96731404413004e-119
TRINITY_DN5992_c0_g1_i13 B25 R25 14.5089667939309 6.1883806816389 6.15524143204199e-119 8.42144744628405e-115
TRINITY_DN1935_c0_g1_i1 B25 R25 -14.0852144942388 5.76481253295512 2.09305509819126e-103 2.29093252717426e-99
TRINITY_DN7856_c0_g1_i1 B25 R25 9.07630178998634 5.87159238635241 2.59525133714949e-102 2.36717199880301e-98
TRINITY_DN9161_c0_g1_i4 B25 R25 14.0055230986592 5.6858177658871 7.69816890064996e-1016.01853842036958e-97
TRINITY_DN41_c0_g1_i1 B25 R25 10.2304801828419 6.36856943794737 7.12383309140778e-99 4.87332516991842e-95
TRINITY_DN6913_c0_g1_i10 B25 R25 -13.8141572932327 5.49418010251322 1.17661182106672e-89 7.15471501461318e-86
關(guān)于FDR的含義與解釋參考 http://www.reibang.com/p/26511d3360c8
MAplot.png
火山圖.png
關(guān)于這個(gè)圖的解釋贴见,參考 http://www.reibang.com/p/26511d3360c8
接下來對(duì)差異數(shù)據(jù)進(jìn)行注釋和處理
首先繪制差異基因的熱圖以及樣本分布
獲取處理腳本的幫助文件
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$ $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl --help
####################################################################################
#
# Required:
#
# --matrix|m <string> TMM.EXPR.matrix
#
# Optional:
#
# -P <float> p-value cutoff for FDR (default: 0.001)
#
# -C <float> min abs(log2(a/b)) fold change (default: 2 (meaning 2^(2) or 4-fold).
#
# --output <float> prefix for output file (default: "diffExpr.P${Pvalue}_C${C})
#
#
#
#
# Misc:
#
# --samples|s <string> sample-to-replicate mappings (provided to run_DE_analysis.pl)
#
# --max_DE_genes_per_comparison <int> extract only up to the top number of DE features within each pairwise comparison.
# This is useful when you have massive numbers of DE features but still want to make
# useful heatmaps and other plots with more manageable numbers of data points.
#
# --order_columns_by_samples_file instead of clustering samples or replicates hierarchically based on gene expression patterns,
# order columns according to order in the --samples file.
#
# --max_genes_clust <int> default: 10000 (if more than that, heatmaps are not generated, since too time consuming)
#
# --examine_GO_enrichment run GO enrichment analysis
# --GO_annots <string> GO annotations file
# --gene_lengths <string> lengths of genes file
#
# --include_GOplot optional: will generate inputs to GOplot and attempt to make a preliminary pdf plot/report for it.
#
##############################################################
運(yùn)行這個(gè)腳本進(jìn)行計(jì)算后的得到結(jié)果
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$ l -alt
total 76916
-rw-rw-r-- 1 yeyuntian yeyuntian 52233286 1月 29 17:21 diffExpr.P0.001_C2.matrix.RData
drwxrwxr-x 2 yeyuntian yeyuntian 4096 1月 29 17:21 ./
-rw-rw-r-- 1 yeyuntian yeyuntian 206154 1月 29 17:21 diffExpr.P0.001_C2.matrix.log2.centered.genes_vs_samples_heatmap.pdf
-rw-rw-r-- 1 yeyuntian yeyuntian 6462 1月 29 17:21 diffExpr.P0.001_C2.matrix.log2.centered.sample_cor_matrix.pdf
-rw-rw-r-- 1 yeyuntian yeyuntian 626 1月 29 17:21 diffExpr.P0.001_C2.matrix.log2.centered.sample_cor.dat
-rw-rw-r-- 1 yeyuntian yeyuntian 490031 1月 29 17:21 diffExpr.P0.001_C2.matrix.log2.centered.dat
-rw-rw-r-- 1 yeyuntian yeyuntian 4611 1月 29 17:21 diffExpr.P0.001_C2.matrix.R
-rw-rw-r-- 1 yeyuntian yeyuntian 224690 1月 29 17:21 diffExpr.P0.001_C2.matrix
-rw-rw-r-- 1 yeyuntian yeyuntian 59 1月 29 17:21 DE_feature_counts.P0.001_C2.matrix
-rw-rw-r-- 1 yeyuntian yeyuntian 58609 1月 29 17:21 RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.DE_results.P0.001_C2.DE.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 23967 1月 29 17:21 RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.DE_results.P0.001_C2.R25-UP.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 34712 1月 29 17:21 RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.DE_results.P0.001_C2.W25-UP.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 239536 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.DE_results.P0.001_C2.B25-UP.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 400145 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.DE_results.P0.001_C2.DE.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 160679 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.DE_results.P0.001_C2.W25-UP.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 36 1月 29 17:21 RSEM.isoform.counts.matrix.R25_vs_W25.edgeR.DE_results.samples
-rw-rw-r-- 1 yeyuntian yeyuntian 36 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_W25.edgeR.DE_results.samples
-rw-rw-r-- 1 yeyuntian yeyuntian 219363 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results.P0.001_C2.B25-UP.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 317428 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results.P0.001_C2.DE.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 98135 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results.P0.001_C2.R25-UP.subset
-rw-rw-r-- 1 yeyuntian yeyuntian 36 1月 29 17:21 RSEM.isoform.counts.matrix.B25_vs_R25.edgeR.DE_results.samples
在這些結(jié)果包括
- 兩兩比較的樣本信息(.samples)
- 樣本比較后的基因表達(dá)情況上調(diào)還是下調(diào)的文件以及整合在一起的基因(.subset)
- 差異基因在不同樣本中分布數(shù)量(DE_feature_counts.P0.001_C2.matrix)
- 差異基因表達(dá)量矩陣(diffExpr.P0.001_C2.matrix)
- 差異基因?qū)?shù)轉(zhuǎn)換后的表達(dá)矩陣 (diffExpr.P0.001_C2.matrix.log2.centered.dat)
-
出的兩個(gè)圖(PDF)
image.png
image.png - 包括一個(gè)R腳本和一個(gè)Rdata(這兩個(gè)文件可以通過修改和調(diào)用實(shí)現(xiàn)對(duì)后期出圖的控制)
最后要提示一點(diǎn)就是注意對(duì)于R腳本修改后可以通過shell在所在文件夾進(jìn)行直接運(yùn)行后得到對(duì)應(yīng)的圖片
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biodata/trinitytest/downstr/RSEMout/RSEMout/DEEdgeRout$ Rscript diffExpr.P0.001_C2.matrix.R