接著前兩篇文章:
DESeq2詳解系列(1):http://www.reibang.com/p/88511070e2dd
DESeq2詳解系列(2):http://www.reibang.com/p/ffc16dacc711
探索末捣、輸出結(jié)果
MA-plot
每個點是一個gene肚逸,橫坐標是平均的標準化counts葵硕,縱坐標是logFC
plotMA(res, ylim=c(-2,2))
#對log2 fold changes做了shrink以后
plotMA(resLFC, ylim=c(-2,2))
#交互式地判斷每個點對應(yīng)什么基因
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
如果沒有shrink就是這樣痒给,低表達量的基因往往離散程度更大
做了矯正后:
其他的shrinkage estimators
有時候?qū)τ谝恍?shù)據(jù)集之前的shrink會過強,因此也可以考慮其他方法:
通過修改lfcShrink參數(shù)type來指定
The options for type are:
normal is the the original DESeq2 shrinkage estimator, an adaptive Normal distribution as prior. This is currently the default, although the default will likely change to apeglm in the October 2018 release given apeglm’s superior performance.
apeglm is the adaptive t prior shrinkage estimator from the apeglm package (Zhu, Ibrahim, and Love 2018).
ashr is the adaptive shrinkage estimator from the ashr package (Stephens 2016). Here DESeq2 uses the ashr option to fit a mixture of Normal distributions to form the prior, with method="shrinkage".
resultsNames(dds)
# because we are interested in treated vs untreated, we set 'coef=2'
resNorm <- lfcShrink(dds, coef=2, type="normal")
resAsh <- lfcShrink(dds, coef=2, type="ashr")
#比較三種shrink方法
par(mfrow=c(1,3), mar=c(4,4,2,1))#修改圖的展示方式
xlim <- c(1,1e5); ylim <- c(-3,3)
plotMA(resLFC, xlim=xlim, ylim=ylim, main="apeglm")
plotMA(resNorm, xlim=xlim, ylim=ylim, main="normal")
plotMA(resAsh, xlim=xlim, ylim=ylim, main="ashr")
如果需要校正批次效應(yīng)两入,在design里可以聲明好batch factor袱饭,或者使用其他的包
,比如sva或者RUVseq去捕捉那些潛在會產(chǎn)生異質(zhì)性數(shù)據(jù)的無關(guān)變量
關(guān)于sva是什么喇喉,可以參考:https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161
實際上是幫助捕獲并模擬那些我們觀測或測定不了的會對生物學(xué)結(jié)果造成干擾的變量
基因表達水平可以受很多因素影響,比如遺傳校坑、環(huán)境拣技、人群、技術(shù)等耍目。除了有些已知因素可以測量以外膏斤,很多因素實際上要么是未知的要么是無法測定或者過于復(fù)雜以至于不能在單一模型里很好地捕獲它們。如果不能把這些因素造成的異質(zhì)性考慮進去邪驮,實際上很有可能對研究結(jié)果產(chǎn)生較大的影響莫辨。本文介紹的SVA(‘surrogate variable analysis)是這樣一種方法,它能夠準確地捕獲表達信息和任何建模變量間的關(guān)系耕捞,從而增強生物數(shù)據(jù)的準確性以及分析的可重復(fù)性衔掸。
Plot counts
#怎么檢查某個基因在不同組里的表達情況烫幕?
plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
#還可以用ggplot畫
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition",
returnData=TRUE)
library("ggplot2")
ggplot(d, aes(x=condition, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))
關(guān)于Results
#關(guān)于results的詳細信息:變量和檢驗方法的查看
mcols(res)$description
## [1] "mean of normalized counts for all samples"
## [2] "log2 fold change (MLE): condition treated vs untreated"
## [3] "standard error: condition treated vs untreated"
## [4] "Wald statistic: condition treated vs untreated"
## [5] "Wald test p-value: condition treated vs untreated"
## [6] "BH adjusted p-values"
實際上p-value以及一些值可能會輸出NA俺抽,原因有:
- 如果某基因在所有樣本的表達值都是0,將導(dǎo)致pvalue和padj都為NA
- 如果某個基因在某個樣本里有異常的count離群值较曼;異常點的檢測基于Cook’s distance磷斧,不過對異常點的處理是可以自己改的
- 實際上在DESeq2和edgeR中除了library normalization之外,還會進行Independent Filtering。Independent Filtering篩去一些不太可能有生物學(xué)差異的基因弛饭,而不管它的p值是否顯著冕末,以盡量降低假陰性結(jié)果。如果某個基因因為這個(可能是low mean count)被自動過濾了侣颂,那么雖然有pvalue但padj為NA档桃,詳細可參考statQuest課程 http://www.360doc.com/content/18/1007/19/51784026_792756710.shtml
Rich visualization and reporting of results
如果想生成html報告,可以考慮ReportingTools這個包:http://bioconductor.org/packages/release/bioc/html/ReportingTools.html憔晒,示例代碼在對應(yīng)vignette的RNA-seq differential expression的文檔中可以找到
regionReport藻肄、Glimma 、pcaExplorer拒担、iSEE嘹屯、DEvis等也有交互展示運行結(jié)果的功能
Exporting results to CSV files
write.csv(as.data.frame(resOrdered),
file="condition_treated_results.csv")
#還可以通過條件限制需要輸出的結(jié)果:subset函數(shù)
resSig <- subset(resOrdered, padj < 0.1)
resSig
多因子實驗設(shè)計
這里簡單介紹一點,具體的formula設(shè)計在以后的推文會討論从撼。
實際上design的公式可以是非常多樣的州弟,還可以考慮因子間交互作用。pasilla包除了condition以外低零,還有測序類型可以考慮進去:
colData(dds)
# DataFrame with 7 rows and 3 columns
# condition type sizeFactor
# <factor> <factor> <numeric>
# treated1 treated single-read 1.63557509657607
# treated2 treated paired-end 0.761269768042316
# treated3 treated paired-end 0.832652635328833
# untreated1 untreated single-read 1.13826297659084
# untreated2 untreated single-read 1.79300035535039
# untreated3 untreated paired-end 0.649547030603726
# untreated4 untreated paired-end 0.751689223426488
#先拷貝一份dds用來做多因子分析
ddsMF <- dds
levels(ddsMF$type)
## [1] "paired-end" "single-read"
levels(ddsMF$type) <- sub("-.*", "", levels(ddsMF$type))
levels(ddsMF$type)
#注意感興趣的變量要放到末尾
design(ddsMF) <- formula(~ type + condition)
ddsMF <- DESeq(ddsMF)
resMF <- results(ddsMF)
head(resMF)
實際上我們也可以獲得其他變量造成的log2FC婆翔、pvalue等參數(shù)的結(jié)果(這里的其他變量只是測序類型,不是生物學(xué)上有意義的變量)
毁兆;如果是這樣的design:~genotype + condition + genotype:condition浙滤,說明我們對由基因型造成的表達差異感興趣
這里演示只考慮type作為變量的差異分析
resMFType <- results(ddsMF,
contrast=c("type", "single", "paired"))
head(resMFType)
# log2 fold change (MLE): type single vs paired
# Wald test p-value: type single vs paired
# DataFrame with 6 rows and 6 columns
# baseMean log2FoldChange lfcSE
# <numeric> <numeric> <numeric>
# FBgn0000003 0.171568715207063 -1.61158240361812 3.87108289314
# FBgn0000008 95.1440789963134 -0.262254386430198 0.22068580426262
# FBgn0000014 1.05657219346166 3.29058255215038 2.08724091994889
# FBgn0000015 0.846723274987709 -0.581642730889627 2.18165959131568
# FBgn0000017 4352.5928987935 -0.0997652738257474 0.111757030425811
# FBgn0000018 418.614930505122 0.229299212203436 0.1305752643427
# stat pvalue padj
# <numeric> <numeric> <numeric>
# FBgn0000003 -0.416313070038884 0.677180929822828 NA
# FBgn0000008 -1.18836092473855 0.234691244221223 0.543823121250909
# FBgn0000014 1.57652263363587 0.114905406827234 NA
# FBgn0000015 -0.266605630504831 0.789772819994317 NA
# FBgn0000017 -0.892697966701751 0.372018939542609 0.683007220487336
# FBgn0000018 1.7560692935044 0.0790765774697509 0.292463782417282
到這里主要的workflow就結(jié)束了,后面會介紹更高階的操作:數(shù)據(jù)轉(zhuǎn)換气堕、可視化纺腊、個性化分析等