1.構(gòu)建DESeq2對(duì)象
(1)從SummarizedExperiment對(duì)象構(gòu)建DESeqDataSet對(duì)象
dds <- DESeqDataSet( se, design = ~ cell + dex)
- se為RangedSummarizedExperiment對(duì)象鼠渺,行信息rowRanges(se)為基因區(qū)間帮掉,列信息colData(se)為樣本信息(四種細(xì)胞系镶摘,每一個(gè)cell lines都有對(duì)照組與dex處理組)拭宁,實(shí)驗(yàn)數(shù)據(jù)assays(se)為原始的read counts。
- dds為DESeqDataSet對(duì)象吕喘,dds與se的區(qū)別是se的assay slot被DESeq2的counts accessor function代替瓷胧。
- dds包含設(shè)計(jì)公式design formula泪喊,design(dds)。實(shí)驗(yàn)設(shè)計(jì)為cell+dex骏庸,希望檢測(cè)對(duì)于不同細(xì)胞毛甲,地塞米松處理的效果
(2)從表達(dá)矩陣countData和樣品信息colData構(gòu)建DESeqDataSet對(duì)象
dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ Group)
- 表達(dá)矩陣countData,行為基因具被,列為樣本玻募,表達(dá)量必須是非負(fù)整數(shù)。
- 樣本信息colData一姿,每一行對(duì)應(yīng)一個(gè)樣本七咧,行名與countData的樣本順序一一對(duì)應(yīng),列為各種分組信息叮叹。
- 設(shè)計(jì)公式通常格式為
~ batch + conditions
艾栋,batch和conditions都是colData的一列,是因子型變量蛉顽。為了方便后續(xù)計(jì)算蝗砾,最為關(guān)注的分組信息放在最后一位。如果記錄了樣本的批次信息,或者其它需要抹除的信息可以定義在design參數(shù)中悼粮,在下游回歸分析中會(huì)根據(jù)design formula來(lái)估計(jì)batch effect的影響闲勺,并在下游分析時(shí)減去這個(gè)影響。這是處理batch effect的推薦方式扣猫。在模型中考慮batch effect并沒有在數(shù)據(jù)矩陣中移除bacth effect菜循,如果下游處理時(shí)確實(shí)有需要,可以使用limma包的removeBatchEffect來(lái)處理苞笨。 - 默認(rèn)情況下债朵,R會(huì)根據(jù)字母表順序排列因子型變量,排在最前面的因子作為對(duì)照瀑凝。設(shè)置對(duì)照:
colData$Group <- relevel(colData$Group, ref=“WTF”)
或colData$Group <- factor(colData$Group, levels = c(“WTF”, ”WTM”, ”MF”, ”MMF”))
- designs with multiple variables, e.g.,
~ group + condition
, and designs with interactions (answering: is the condition effect different across genotypes?) , e.g.,~ genotype + treatment + genotype:treatment
. 默認(rèn)情況下序芦,此包中的函數(shù)將使用公式中的最后一個(gè)變量來(lái)構(gòu)建結(jié)果表和繪圖。 -
design(dds) <- value
. value, a formula used for estimating dispersion and fitting Negative Binomial GLMs.
2.過(guò)濾低豐度數(shù)據(jù)
dds <- dds[rowSums(counts(dds)) > 1, ]
或者在構(gòu)建dds之前加上
countData <- count[apply(count, 1, sum) > 1 , ]
- 在獨(dú)立篩選(independent filtering)中粤咪,DESeq2可以去掉在所有樣品中平均表達(dá)量CPM不大于min.CPM的基因谚中,以減少假陰性。
- EdgeR是保留在2個(gè)或更多樣品中表達(dá)量大于min.CPM的基因寥枝。
-
可以嘗試不同的cutoff宪塔,以獲得最佳效果。
statquest
3.兩種數(shù)據(jù)轉(zhuǎn)化方法
(1)方差穩(wěn)定變換囊拜,The variance stabilizing transformation
vsd <- vst(object=dds,blind=T)
- 樣本信息的列名names(colData(vsd))多了1列sizeFactor某筐,colData(vsd)$sizeFactor
- 基因信息的列名names(rowData(vsd))多了4列
- vst函數(shù)快速估計(jì)離散趨勢(shì)并應(yīng)用方差穩(wěn)定變換。該函數(shù)從擬合的離散-均值關(guān)系中計(jì)算方差穩(wěn)定變換(VST)冠跷,然后變換count data(除以標(biāo)準(zhǔn)化因子)南誊,得到一個(gè)近似為同方差的值矩陣(沿均值范圍具有恒定的方差)。許多常見的多維數(shù)據(jù)探索性分析方法蜜托,例如聚類或PCA抄囚,對(duì)于同方差的數(shù)據(jù)表現(xiàn)良好。數(shù)據(jù)集小于30個(gè)樣品可以用rlog橄务,數(shù)據(jù)集大于30個(gè)樣品用vst幔托,因?yàn)閞log速度慢。
(2)正則化對(duì)數(shù)變換蜂挪,The regularized-logarithm transformation
rld <- rlog(object=dds,blind=F)
- 樣本信息多了1列sizeFactor重挑,和vsd的sizeFactor相同
- 基因信息多了7列
- rlog函數(shù)將count data轉(zhuǎn)換為log2尺度,以最小化有small counts的行的樣本間差異棠涮,并使library size標(biāo)準(zhǔn)化谬哀。rlog在size factors變化很大的情況下更穩(wěn)健叁扫。
(3)用法
- blind穆碎,轉(zhuǎn)換時(shí)是否忽視實(shí)驗(yàn)設(shè)計(jì)。
blind=T
项秉,不考慮實(shí)驗(yàn)設(shè)計(jì),用于樣品質(zhì)量保證(sample quality assurance劲室,QA)伦仍。blind=F
,考慮實(shí)驗(yàn)設(shè)計(jì)很洋,用于downstream analysis充蓝。
(4)為什么要轉(zhuǎn)換?為了確保所有基因有大致相同的貢獻(xiàn)喉磁。
對(duì)于RNA-seq raw counts谓苟,方差隨均值增長(zhǎng)。如果直接用size-factor-normalized read counts:counts(dds, normalized=T) 進(jìn)行主成分分析协怒,結(jié)果通常只取決于少數(shù)幾個(gè)表達(dá)最高的基因涝焙,因?yàn)樗鼈冿@示了樣本之間最大的絕對(duì)差異。為了避免這種情況孕暇,一個(gè)策略是采用the logarithm of the normalized count values plus a small pseudocount:log2(counts(dds2, normalized=T) +1)
仑撞。但是這樣,有很低counts的基因?qū)A向于主導(dǎo)結(jié)果妖滔。作為一種解決方案隧哮,DESeq2為counts數(shù)據(jù)提供了stabilize the variance across the mean的轉(zhuǎn)換。其中之一是regularized-logarithm transformation or rlog2座舍。對(duì)于counts較高的基因沮翔,rlog轉(zhuǎn)換可以得到與普通log2轉(zhuǎn)換相似的結(jié)果。然而曲秉,對(duì)于counts較低的基因采蚀,所有樣本的值都縮小到基因的平均值。
用于繪制PCA圖或聚類的數(shù)據(jù)可以有多種:counts岸浑、CPM搏存、log2(counts+1)瑰步、log2(CPM+1)矢洲、vst、rlog等缩焦。
4. DESeq2的標(biāo)準(zhǔn)化方法
(1)計(jì)算歸一化系數(shù)sizeFactor
dds <- estimateSizeFactors(dds)
- colData(dds)多了sizeFactor這一列读虏,對(duì)測(cè)序深度和文庫(kù)組成進(jìn)行校正。和vsd袁滥、rld的sizeFactor是一樣的盖桥。
(2)標(biāo)準(zhǔn)化之后的數(shù)據(jù)
normalized_counts <- counts(dds,normalized=T)
- 將原始的表達(dá)量除以每個(gè)樣本的歸一化系數(shù),就得到了歸一化之后的表達(dá)量题翻。read counts/sizeFactor揩徊。
5. 差異表達(dá)分析
(1)一步
dds <- DESeq(dds)
(2)用法
DESeq(object, test = c("Wald", "LRT"), fit Type = c("parametric", "local", "mean"), sfType = c("ratio", "poscounts", "iterate"),betaPrior, full = design(object), reduced, quiet = FALSE, minReplicatesForReplace = 7, modelMatrixType, useT = FALSE, minmu = 0.5, parallel = FALSE, BPPARAM = bpparam())
- test可以是Wald significance tests或likelihood ratio test(似然比檢驗(yàn)),on the difference in deviance between a full and reduced model formula。
(3)分步
- 計(jì)算歸一化系數(shù)sizeFactor
dds <- estimateSizeFactors(dds)
- 估計(jì)基因的離散程度
dds <- estimateDispersions(dds)
DESeq2假定基因的表達(dá)量符合負(fù)二項(xiàng)分布塑荒,有兩個(gè)關(guān)鍵參數(shù)熄赡,總體均值和離散程度α值。這個(gè)α值衡量的是均值和方差之間的關(guān)系齿税。
- 統(tǒng)計(jì)檢驗(yàn)彼硫,差異分析
dds <- nbinomWaldTest(dds)
6. 獲得分析結(jié)果
(1)默認(rèn)情況
res <- results(dds)
- 默認(rèn)使用樣本信息的最后一個(gè)因子與第一個(gè)因子進(jìn)行比較。
- 返回一個(gè)數(shù)據(jù)框res凌箕,包含6列:baseMean拧篮、log2FC、lfcSE牵舱、stat串绩、pvalue、padj
- baseMean表示所有樣本經(jīng)過(guò)歸一化系數(shù)矯正的read counts(counts/sizeFactor)的均值芜壁。
baseMean = apply( normalized_counts, 1, mean )
赏参。 - log2Foldchange表示該基因的表達(dá)發(fā)生了多大的變化,是估計(jì)的效應(yīng)大小effect size沿盅。對(duì)差異表達(dá)的倍數(shù)取以2為底的對(duì)數(shù)把篓,變化倍數(shù)=2^log2Foldchange。
log2FoldChange = apply( normalized_counts, 1, function(t) {log2( mean(t[5:8])/ mean(t[1:4]))}
(并不完全相等腰涧。log2FC反映的是不同分組間表達(dá)量的差異韧掩,這個(gè)差異由兩部分構(gòu)成,一種是樣本間本身的差異窖铡,比如生物學(xué)重復(fù)樣本間基因的表達(dá)量就有一定程度的差異疗锐,另外一部分就是我們真正感興趣的,由于分組不同或者實(shí)驗(yàn)條件不同造成的差異费彼。用歸一化之后的數(shù)值直接計(jì)算出的log2FC包含了以上兩種差異滑臊,而我們真正感興趣的只有分組不同造成的差異,DESeq2在差異分析的過(guò)程中已經(jīng)考慮到了樣本本身的差異箍铲,其最終提供的log2FC只包含了分組間的差異雇卷,所以會(huì)與手動(dòng)計(jì)算的不同)。 - lfcSE(logfoldchange Standard Error)是對(duì)于log2Foldchange估計(jì)的標(biāo)準(zhǔn)誤差估計(jì)颠猴,效應(yīng)大小估計(jì)有不確定性关划。
- stat是Wald統(tǒng)計(jì)量,它是由log2Foldchange除以標(biāo)準(zhǔn)差所得翘瓮。
- pvalue和padj分別代表原始的p值以及經(jīng)過(guò)校正后的p值贮折。adjusted p value less than 0.1 should contain no more than 10% false positives.
- Need to filter on adjusted p-values, not p-values, to obtain FDR control. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok.
(2)比較任何兩組數(shù)據(jù)
resultsNames(dds)
res <- results(dds, name="Group_MMF_vs_WTF")
res <- results(dds, contrast=c("Group"," MMF "," WTF ")) #后面的是對(duì)照
(3)用法
results(object, contrast, name, lfcThreshold = 0, altHypothesis = c("greaterAbs", "less Abs", "greater", "less"), listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta, p Adjust Method = "BH", filterFun, format = c("DataFrame", "GRanges", "GRangesList"), test,
addMLE = FALSE, tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), minmu = 0.5)
- contrast
this argument specifies what comparison to extract from the object to build a results table. one of either:(此參數(shù)指定從對(duì)象中提取什么比較以構(gòu)建結(jié)果表)
a) a character vector with exactly three elements: the name of a factor in the design formula, the name of the numerator level for the fold change, and the name of the denominator level for the fold change (simplest case) (有三個(gè)元素的字符向量:設(shè)計(jì)公式中一個(gè)因子的名稱、fold change的分子level的名稱资盅、fold change的分母level的名稱)
b) a list of 2 character vectors: the names of the fold changes for the numerator, and the names of the fold changes for the denominator. these names should be elements of resultsNames(object). if the list is length 1, a second element is added which is the empty character vector, character().
(more general case, can be to combine interaction terms and main effects)(一個(gè)由兩個(gè)字符向量組成的列表:分子的fold change名稱调榄,分母的fold change名稱踊赠。這些名稱應(yīng)該是resultsNames(object)。如果列表是長(zhǎng)度1每庆,則添加第二個(gè)元素臼疫,即空字符向量character()。更一般的情況是扣孟,可以將交互項(xiàng)和主要效果結(jié)合起來(lái)烫堤。)
c) a numeric contrast vector with one element for each element in resultsNames(object) (most general case)(具有一個(gè)元素的數(shù)值型對(duì)比向量,對(duì)于resultsNames(object)中的每一個(gè)元素)
If specified, the name argument is ignored. - name
the name of the individual effect (coefficient) for building a results table. Use this argument rather than contrast for continuous variables, individual effects or for individual interaction terms. The value provided to name must be an element of resultsNames(object).(建立結(jié)果表的單個(gè)效應(yīng)(系數(shù))的名稱凤价。對(duì)于連續(xù)變量鸽斟、單個(gè)效應(yīng)或單個(gè)交互項(xiàng),使用此參數(shù)而不是contrast利诺。 提供給name的值必須是resultsNames(object)的元素) - lfcThreshold
a non-negative value which specifies a log2 fold change threshold. The default value is 0, corresponding to a test that the log2 fold changes are equal to zero. - independentFiltering
logical, whether independent filtering should be applied automatically - alpha
the significance cutoff used for optimizing the independent filtering (by default 0.1). If the adjusted p-value cutoff (FDR) will be a value other than 0.1, alpha should be set to that value.(用于優(yōu)化獨(dú)立篩選的顯著性截止值(默認(rèn)情況下為0.1)富蓄。如果adjusted p-value cutoff (FDR)是0.1以外的值,則α應(yīng)設(shè)置為該值慢逾。) - two conditions, three genotypes, with an interaction term. (2種條件立倍,3種基因型,和相互作用項(xiàng)) The interaction term, answering: is the condition effect different across genotypes?(相互作用項(xiàng)解釋了條件的效果在基因型之間是否不同)
dds <- makeExampleDESeqDataSet(n=100,m=18)
dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds$condition
[1] A A A A A A A A A B B B B B B B B B
Levels: A B
dds$genotype
[1] I I I II II II III III III I I I II II II III III III
Levels: I II III
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept" "genotype_II_vs_I"
[3] "genotype_III_vs_I" "condition_B_vs_A"
[5] "genotypeII.conditionB" "genotypeIII.conditionB"
# the condition effect for genotype I (the main effect)(基因型I的條件效應(yīng)侣滩,主要效應(yīng))
results(dds, contrast=c("condition","B","A"))
log2 fold change (MLE): condition B vs A
Wald test p-value: condition B vs A
# the condition effect for genotype III.(基因型III的條件效應(yīng))
# this is the main effect *plus* the interaction term (主要效應(yīng)加上相互作用項(xiàng))
# (the extra condition effect in genotype III compared to genotype I).(基因型III與基因型I相比額外的條件作用效果)
results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") ))
log2 fold change (MLE): condition_B_vs_A+genotypeIII.conditionB effect
Wald test p-value: condition_B_vs_A+genotypeIII.conditionB effect
# the interaction term for condition effect in genotype III vs genotype I.(基因型III與基因型I相比額外的條件作用效果)
# this tests if the condition effect is different in III compared to I(檢測(cè)了條件效果在基因型III與基因型I之間是否有區(qū)別)
results(dds, name="genotypeIII.conditionB")
log2 fold change (MLE): genotypeIII.conditionB
Wald test p-value: genotypeIII.conditionB
# the interaction term for condition effect in genotype III vs genotype II. (基因型III與基因型II相比額外的條件作用效果)
# this tests if the condition effect is different in III compared to II(檢測(cè)了條件效果在基因型III與基因型II之間是否有區(qū)別)
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB
# Note that a likelihood ratio could be used to test if there are any
# differences in the condition effect between the three genotypes.
- Using a grouping variable
# This is a useful construction when users just want to compare
# specific groups which are combinations of variables.
dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
# the condition effect for genotypeIII
results(dds, contrast=c("group", "IIIB", "IIIA"))
(4)結(jié)果的簡(jiǎn)單統(tǒng)計(jì)
summary(res)
summary(res, alpha=0.1)
- alpha: the adjusted p-value cutoff. If not set, this defaults to the alpha argument which was used in results to set the target FDR for independent filtering, or if independent filtering was not performed, to 0.1.
(5)排序和篩選
resOrdered <- res[order(res$pvalue), ] #從小到大排序口注,默認(rèn)decreasing = F
sum(res$padj < 0.1, na.rm=TRUE) #有多少padj小于0.1的
diff_gene <-subset(res, padj < 0.1 & abs(log2FoldChange) > 1)
(6)2種更嚴(yán)格的方法篩選顯著差異基因
- 降低false discovery rate threshold (the threshold on padj in the results table)
- 提升log2 fold change threshold (from 0 using the lfc Threshold argument of results)
res.05 <- results(dds, alpha=0.05)
#alpha為padj的閾值,默認(rèn)padj=0.1君珠。
resLFC <- results(dds, lfcThreshold=1)
#提升log2 fold change threshold寝志,結(jié)果中不滿足lfc閾值的gene的p值都是1。
7. LFC校正lfcShrink
(1)介紹
Adds shrunken log2 fold changes (LFC) and SE to a results table from DESeq run without LFC shrinkage. For consistency with results, the column name lfcSE is used here although what is returned is a posterior SD. Three shrinkage estimators for LFC are available via type (see the
vignette for more details on the estimators). The apeglm publication demonstrates that ’apeglm’ and ’ashr’ outperform the original ’normal’ shrinkage estimator.
The shrunken fold changes are useful for ranking genes by effect size and for visualization.
縮小的倍數(shù)變化有助于按效應(yīng)大小對(duì)基因進(jìn)行排序和可視化策添。
log2FC estimates do not account for the large dispersion we observe with low read counts.
log2 FC估計(jì)不能解釋我們?cè)诘蛂ead counts下觀察到的大的離散程度材部。
As with the shrinkage of dispersion estimates, LFC shrinkage uses information from all genes to generate more accurate estimates.
與估計(jì)離散程度的收縮一樣,LFC收縮使用來(lái)自所有基因的信息來(lái)生成更準(zhǔn)確的估計(jì)唯竹。
如果要根據(jù)LFC值提取差異基因乐导,需要shrunken values。另外浸颓,進(jìn)行功能分析例如GSEA時(shí)物臂,需要提供shrunken values。
(2)應(yīng)用
resultsNames(dds)
[1] "Intercept" "condition_treated_vs_untreated"
resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm")
#或
resLFC <- lfcShrink(dds, contrast = c("condition"," treated "," untreated "))
- 選擇的apeglm參數(shù)進(jìn)行effect size shrinkage猾愿,改善了先前的估計(jì)鹦聪。resLFC相比res數(shù)據(jù)更加緊湊账阻。type可以選擇apeglm蒂秘、ashr等。
names(resLFC)
[1] "baseMean" "log2FoldChange" "lfcSE" "pvalue" "padj"
- 與res相比少了"stat"一列淘太。這一步只會(huì)對(duì)LFC的值產(chǎn)生影響姻僧,p值是沒有改變的规丽,不會(huì)改變顯著差異的基因總數(shù)。
(3)用法
lfcShrink(dds, coef, contrast, res, type = c("normal", "apeglm", "ashr"), lfcThreshold = 0, svalue = FALSE, return List = FALSE, format = c("DataFrame", "GRanges", "GRangesList"), apeAdapt = TRUE, apeMethod = "nbinomCR", parallel = FALSE, BPPARAM = bpparam(), quiet = FALSE, ...)
- coef
the name or number of the coefficient (LFC) to shrink, consult resultsNames(dds) after running DESeq(dds). note: only coef or contrast can be specified, not both. apeglm requires use of coef. For normal, one of coef or contrast must be provided.(要收縮的系數(shù)的名稱或編號(hào)撇贺,通過(guò)resultsNames(dds)查看赌莺。coef或contrast二選一。apeglm需要coef松嘶。) - contrast
see argument description in results. only coef or contrast can be specified, not both. - type
"normal" is the original DESeq2 shrinkage estimator; "apeglm" is the adaptive t prior shrinkage estimator from the ’apeglm’ package; "ashr" is the adaptive shrinkage estimator from the ’ashr’ package, using a fitted mixture of normal prior.
8. 似然比檢驗(yàn)LRT
ddsLRT <- DESeq(dds, test="LRT", reduced=~1)
resLRT <- results(ddsLRT)
9. 開啟多線程
library("BiocParallel")
register(MulticoreParam(4))
#先預(yù)定4個(gè)核艘狭,等需要的時(shí)候直接使用parallel=TRUE來(lái)調(diào)用。
10. Plotting results
(1)樣本間關(guān)系熱圖(總體相似度)
library(pheatmap)
library(RColorBrewer)
rld
sampleDist <- dist(t(assay(rld))) #樣品距離,歐氏距離,t轉(zhuǎn)置
#為確保所有基因大致相同的contribution用rlog-transformed data
#畫某些基因在樣本間的heatmap也可以用rlog數(shù)據(jù)
#用PoiClaClu包計(jì)算泊松距離(Poisson Distance)翠订,必須是原始表達(dá)矩陣
#poisd <- PoissonDistance(t(counts(dds)))
sampleDistMatrix <- as.matrix(sampleDist) #樣品間距離的矩陣
rownames(sampleDistMatrix) <- paste0(rld $cell,"-", rld$dex)
colnames(sampleDistMatrix) <- NULL
head(sampleDistMatrix) #樣品間距離的數(shù)據(jù)框
colors <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDist,
clustering_distance_cols=sampleDist,
color = colors)
(2)多維尺度分析(multidimensional scaling巢音,MDS)或主坐標(biāo)分析(principal coordinates analysis,PCoA)
library(ggplot2)
#把樣本之間的距離轉(zhuǎn)化成二維坐標(biāo),在降維過(guò)程中保證樣品點(diǎn)之間的距離不變
#MDS基于最小線性距離(歐氏距離)的聚類,與PCA的最大線性相關(guān)是一樣的
#適合在沒有表達(dá)矩陣值,但只有一個(gè)距離矩陣的情況下使用
mdsdata <- data.frame(cmdscale(sampleDistMatrix))
#cmdscale(classical multidimensional scaling)
mdsdata #返回MDS的坐標(biāo)
mds <- cbind(mdsdata,as.data.frame(colData(vsd)))
mds #按列合并
ggplot(data=mds,aes(X1,X2,color=cell,shape=dex)) +
geom_point(size=3)
(3)主成分分析(Principal Component Analysis尽超,PCA)
pcadata <- plotPCA(vsd,intgroup = c("Batch","Group"), returnData=TRUE)
percentVar <- round(100*attr(pcadata,"percentVar"),1)
ggplot(pcadata, aes(PC1, PC2, color=Group, shape=Batch)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text_repel(aes(PC1, PC2,color=Group,label=colnames(vsd)),size=3) +
theme_bw()
(4)plotCounts()函數(shù)查看特定基因的表達(dá)量
topGene <- rownames(res)[which.min(res$padj)] #padj最小的一個(gè)基因
plotCounts(dds, gene=topGene, intgroup=c("dex")) #畫出這個(gè)基因的標(biāo)準(zhǔn)化后的表達(dá)量
#以散點(diǎn)圖的形式畫出這個(gè)基因在各樣本中的表達(dá)量
data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +
scale_y_log10() +
geom_point(position=position_jitter(width=.1,height=0), size=3)
ggplot(data, aes(x=dex, y=count, fill=dex)) +
scale_y_log10() +
geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
scale_y_log10() + geom_point(size=3) + geom_line()
(5)MA圖
plotMA(res, ylim=c(-5,5))
#"M" for minus(減), because a log ratio is equal to log minus log, and "A" for average(均值)
#M對(duì)應(yīng)差異對(duì)比組之間基因表達(dá)變化log2 fold changes (Y軸)
#A對(duì)應(yīng)差異對(duì)比組基因表達(dá)量均值the mean of normalized counts (X軸)
plotMA(res, alpha = 0.1, main = "", xlab = "mean of normalized counts", ylim=c(-5,5))
#alpha為padj顯著性水平閾值官撼,默認(rèn)alpha=0.1
#Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red.
plotMA(resLFC, ylim=c(-5,5)) #提高了log2 fold change閾值
topGene <- rownames(resLFC)[which.min(resLFC$padj)]
with(resLFC[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
}) #標(biāo)記出一個(gè)特定的基因
(6)Removing hidden batch effects
library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)