DESeq2詳細(xì)用法

1.構(gòu)建DESeq2對(duì)象

(1)從SummarizedExperiment對(duì)象構(gòu)建DESeqDataSet對(duì)象

dds <- DESeqDataSet( se, design = ~ cell + dex) 
  • se為RangedSummarizedExperiment對(duì)象鼠渺,行信息rowRanges(se)為基因區(qū)間帮掉,列信息colData(se)為樣本信息(四種細(xì)胞系镶摘,每一個(gè)cell lines都有對(duì)照組與dex處理組)拭宁,實(shí)驗(yàn)數(shù)據(jù)assays(se)為原始的read counts
  • dds為DESeqDataSet對(duì)象吕喘,dds與se的區(qū)別是se的assay slot被DESeq2的counts accessor function代替瓷胧。
  • dds包含設(shè)計(jì)公式design formula泪喊,design(dds)。實(shí)驗(yàn)設(shè)計(jì)為cell+dex骏庸,希望檢測(cè)對(duì)于不同細(xì)胞毛甲,地塞米松處理的效果

(2)從表達(dá)矩陣countData和樣品信息colData構(gòu)建DESeqDataSet對(duì)象

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ Group) 
構(gòu)建DESeqDataSet對(duì)象
  • 表達(dá)矩陣countData,行為基因具被,列為樣本玻募,表達(dá)量必須是非負(fù)整數(shù)。
  • 樣本信息colData一姿,每一行對(duì)應(yīng)一個(gè)樣本七咧,行名與countData的樣本順序一一對(duì)應(yīng),列為各種分組信息叮叹。
  • 設(shè)計(jì)公式通常格式為~ batch + conditions艾栋,batch和conditions都是colData的一列,是因子型變量蛉顽。為了方便后續(xù)計(jì)算蝗砾,最為關(guān)注的分組信息放在最后一位。如果記錄了樣本的批次信息,或者其它需要抹除的信息可以定義在design參數(shù)中悼粮,在下游回歸分析中會(huì)根據(jù)design formula來(lái)估計(jì)batch effect的影響闲勺,并在下游分析時(shí)減去這個(gè)影響。這是處理batch effect的推薦方式扣猫。在模型中考慮batch effect并沒有在數(shù)據(jù)矩陣中移除bacth effect菜循,如果下游處理時(shí)確實(shí)有需要,可以使用limma包的removeBatchEffect來(lái)處理苞笨。
  • 默認(rèn)情況下债朵,R會(huì)根據(jù)字母表順序排列因子型變量,排在最前面的因子作為對(duì)照瀑凝。設(shè)置對(duì)照:colData$Group <- relevel(colData$Group, ref=“WTF”)colData$Group <- factor(colData$Group, levels = c(“WTF”, ”WTM”, ”MF”, ”MMF”))
  • designs with multiple variables, e.g., ~ group + condition, and designs with interactions (answering: is the condition effect different across genotypes?) , e.g., ~ genotype + treatment + genotype:treatment. 默認(rèn)情況下序芦,此包中的函數(shù)將使用公式中的最后一個(gè)變量來(lái)構(gòu)建結(jié)果表和繪圖。
  • design(dds) <- value. value, a formula used for estimating dispersion and fitting Negative Binomial GLMs.

2.過(guò)濾低豐度數(shù)據(jù)

dds <- dds[rowSums(counts(dds)) > 1, ] 

或者在構(gòu)建dds之前加上

 countData <- count[apply(count, 1, sum) > 1 , ] 
  • 在獨(dú)立篩選(independent filtering)中粤咪,DESeq2可以去掉在所有樣品中平均表達(dá)量CPM不大于min.CPM的基因谚中,以減少假陰性。
  • EdgeR是保留在2個(gè)或更多樣品中表達(dá)量大于min.CPM的基因寥枝。
  • 可以嘗試不同的cutoff宪塔,以獲得最佳效果。


    statquest

3.兩種數(shù)據(jù)轉(zhuǎn)化方法

(1)方差穩(wěn)定變換囊拜,The variance stabilizing transformation

vsd <- vst(object=dds,blind=T) 
  • 樣本信息的列名names(colData(vsd))多了1列sizeFactor某筐,colData(vsd)$sizeFactor
  • 基因信息的列名names(rowData(vsd))多了4列
  • vst函數(shù)快速估計(jì)離散趨勢(shì)并應(yīng)用方差穩(wěn)定變換。該函數(shù)從擬合的離散-均值關(guān)系中計(jì)算方差穩(wěn)定變換(VST)冠跷,然后變換count data(除以標(biāo)準(zhǔn)化因子)南誊,得到一個(gè)近似為同方差的值矩陣(沿均值范圍具有恒定的方差)。許多常見的多維數(shù)據(jù)探索性分析方法蜜托,例如聚類或PCA抄囚,對(duì)于同方差的數(shù)據(jù)表現(xiàn)良好。數(shù)據(jù)集小于30個(gè)樣品可以用rlog橄务,數(shù)據(jù)集大于30個(gè)樣品用vst幔托,因?yàn)閞log速度慢。

(2)正則化對(duì)數(shù)變換蜂挪,The regularized-logarithm transformation

rld <- rlog(object=dds,blind=F) 
  • 樣本信息多了1列sizeFactor重挑,和vsd的sizeFactor相同
  • 基因信息多了7列
  • rlog函數(shù)將count data轉(zhuǎn)換為log2尺度,以最小化有small counts的行的樣本間差異棠涮,并使library size標(biāo)準(zhǔn)化谬哀。rlog在size factors變化很大的情況下更穩(wěn)健叁扫。

(3)用法

  • blind穆碎,轉(zhuǎn)換時(shí)是否忽視實(shí)驗(yàn)設(shè)計(jì)。blind=T项秉,不考慮實(shí)驗(yàn)設(shè)計(jì),用于樣品質(zhì)量保證(sample quality assurance劲室,QA)伦仍。blind=F,考慮實(shí)驗(yàn)設(shè)計(jì)很洋,用于downstream analysis充蓝。

(4)為什么要轉(zhuǎn)換?為了確保所有基因有大致相同的貢獻(xiàn)喉磁。

對(duì)于RNA-seq raw counts谓苟,方差隨均值增長(zhǎng)。如果直接用size-factor-normalized read counts:counts(dds, normalized=T) 進(jìn)行主成分分析协怒,結(jié)果通常只取決于少數(shù)幾個(gè)表達(dá)最高的基因涝焙,因?yàn)樗鼈冿@示了樣本之間最大的絕對(duì)差異。為了避免這種情況孕暇,一個(gè)策略是采用the logarithm of the normalized count values plus a small pseudocount:log2(counts(dds2, normalized=T) +1)仑撞。但是這樣,有很低counts的基因?qū)A向于主導(dǎo)結(jié)果妖滔。作為一種解決方案隧哮,DESeq2為counts數(shù)據(jù)提供了stabilize the variance across the mean的轉(zhuǎn)換。其中之一是regularized-logarithm transformation or rlog2座舍。對(duì)于counts較高的基因沮翔,rlog轉(zhuǎn)換可以得到與普通log2轉(zhuǎn)換相似的結(jié)果。然而曲秉,對(duì)于counts較低的基因采蚀,所有樣本的值都縮小到基因的平均值。
用于繪制PCA圖或聚類的數(shù)據(jù)可以有多種:counts岸浑、CPM搏存、log2(counts+1)瑰步、log2(CPM+1)矢洲、vst、rlog等缩焦。

4. DESeq2的標(biāo)準(zhǔn)化方法

(1)計(jì)算歸一化系數(shù)sizeFactor

dds <- estimateSizeFactors(dds) 
  • colData(dds)多了sizeFactor這一列读虏,對(duì)測(cè)序深度和文庫(kù)組成進(jìn)行校正。和vsd袁滥、rld的sizeFactor是一樣的盖桥。

(2)標(biāo)準(zhǔn)化之后的數(shù)據(jù)

normalized_counts <- counts(dds,normalized=T) 
  • 將原始的表達(dá)量除以每個(gè)樣本的歸一化系數(shù),就得到了歸一化之后的表達(dá)量题翻。read counts/sizeFactor揩徊。

5. 差異表達(dá)分析

(1)一步

dds <- DESeq(dds) 
DESeq2

(2)用法

DESeq(object, test = c("Wald", "LRT"), fit Type = c("parametric", "local", "mean"), sfType = c("ratio", "poscounts", "iterate"),betaPrior, full = design(object), reduced, quiet = FALSE, minReplicatesForReplace = 7, modelMatrixType, useT = FALSE, minmu = 0.5, parallel = FALSE, BPPARAM = bpparam())

  • test可以是Wald significance tests或likelihood ratio test(似然比檢驗(yàn)),on the difference in deviance between a full and reduced model formula。

(3)分步

  • 計(jì)算歸一化系數(shù)sizeFactor
dds <- estimateSizeFactors(dds) 
  • 估計(jì)基因的離散程度
dds <- estimateDispersions(dds) 

DESeq2假定基因的表達(dá)量符合負(fù)二項(xiàng)分布塑荒,有兩個(gè)關(guān)鍵參數(shù)熄赡,總體均值和離散程度α值。這個(gè)α值衡量的是均值和方差之間的關(guān)系齿税。


負(fù)二項(xiàng)分布
  • 統(tǒng)計(jì)檢驗(yàn)彼硫,差異分析
dds <- nbinomWaldTest(dds) 

6. 獲得分析結(jié)果

(1)默認(rèn)情況

res <- results(dds) 
  • 默認(rèn)使用樣本信息的最后一個(gè)因子與第一個(gè)因子進(jìn)行比較。
  • 返回一個(gè)數(shù)據(jù)框res凌箕,包含6列:baseMean拧篮、log2FC、lfcSE牵舱、stat串绩、pvalue、padj
  • baseMean表示所有樣本經(jīng)過(guò)歸一化系數(shù)矯正的read counts(counts/sizeFactor)的均值芜壁。baseMean = apply( normalized_counts, 1, mean )赏参。
  • log2Foldchange表示該基因的表達(dá)發(fā)生了多大的變化,是估計(jì)的效應(yīng)大小effect size沿盅。對(duì)差異表達(dá)的倍數(shù)取以2為底的對(duì)數(shù)把篓,變化倍數(shù)=2^log2Foldchange。log2FoldChange = apply( normalized_counts, 1, function(t) {log2( mean(t[5:8])/ mean(t[1:4]))}(并不完全相等腰涧。log2FC反映的是不同分組間表達(dá)量的差異韧掩,這個(gè)差異由兩部分構(gòu)成,一種是樣本間本身的差異窖铡,比如生物學(xué)重復(fù)樣本間基因的表達(dá)量就有一定程度的差異疗锐,另外一部分就是我們真正感興趣的,由于分組不同或者實(shí)驗(yàn)條件不同造成的差異费彼。用歸一化之后的數(shù)值直接計(jì)算出的log2FC包含了以上兩種差異滑臊,而我們真正感興趣的只有分組不同造成的差異,DESeq2在差異分析的過(guò)程中已經(jīng)考慮到了樣本本身的差異箍铲,其最終提供的log2FC只包含了分組間的差異雇卷,所以會(huì)與手動(dòng)計(jì)算的不同)。
  • lfcSE(logfoldchange Standard Error)是對(duì)于log2Foldchange估計(jì)的標(biāo)準(zhǔn)誤差估計(jì)颠猴,效應(yīng)大小估計(jì)有不確定性关划。
  • stat是Wald統(tǒng)計(jì)量,它是由log2Foldchange除以標(biāo)準(zhǔn)差所得翘瓮。
  • pvalue和padj分別代表原始的p值以及經(jīng)過(guò)校正后的p值贮折。adjusted p value less than 0.1 should contain no more than 10% false positives.
  • Need to filter on adjusted p-values, not p-values, to obtain FDR control. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok.

(2)比較任何兩組數(shù)據(jù)

resultsNames(dds) 
resultsNames
res <- results(dds, name="Group_MMF_vs_WTF") 
res <- results(dds, contrast=c("Group"," MMF "," WTF ")) #后面的是對(duì)照 

(3)用法

results(object, contrast, name, lfcThreshold = 0, altHypothesis = c("greaterAbs", "less Abs", "greater", "less"), listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta, p Adjust Method = "BH", filterFun, format = c("DataFrame", "GRanges", "GRangesList"), test,
addMLE = FALSE, tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), minmu = 0.5)

  • contrast
    this argument specifies what comparison to extract from the object to build a results table. one of either:(此參數(shù)指定從對(duì)象中提取什么比較以構(gòu)建結(jié)果表)
    a) a character vector with exactly three elements: the name of a factor in the design formula, the name of the numerator level for the fold change, and the name of the denominator level for the fold change (simplest case) (有三個(gè)元素的字符向量:設(shè)計(jì)公式中一個(gè)因子的名稱、fold change的分子level的名稱资盅、fold change的分母level的名稱)
    b) a list of 2 character vectors: the names of the fold changes for the numerator, and the names of the fold changes for the denominator. these names should be elements of resultsNames(object). if the list is length 1, a second element is added which is the empty character vector, character().
    (more general case, can be to combine interaction terms and main effects)(一個(gè)由兩個(gè)字符向量組成的列表:分子的fold change名稱调榄,分母的fold change名稱踊赠。這些名稱應(yīng)該是resultsNames(object)。如果列表是長(zhǎng)度1每庆,則添加第二個(gè)元素臼疫,即空字符向量character()。更一般的情況是扣孟,可以將交互項(xiàng)和主要效果結(jié)合起來(lái)烫堤。)
    c) a numeric contrast vector with one element for each element in resultsNames(object) (most general case)(具有一個(gè)元素的數(shù)值型對(duì)比向量,對(duì)于resultsNames(object)中的每一個(gè)元素)
    If specified, the name argument is ignored.
  • name
    the name of the individual effect (coefficient) for building a results table. Use this argument rather than contrast for continuous variables, individual effects or for individual interaction terms. The value provided to name must be an element of resultsNames(object).(建立結(jié)果表的單個(gè)效應(yīng)(系數(shù))的名稱凤价。對(duì)于連續(xù)變量鸽斟、單個(gè)效應(yīng)或單個(gè)交互項(xiàng),使用此參數(shù)而不是contrast利诺。 提供給name的值必須是resultsNames(object)的元素)
  • lfcThreshold
    a non-negative value which specifies a log2 fold change threshold. The default value is 0, corresponding to a test that the log2 fold changes are equal to zero.
  • independentFiltering
    logical, whether independent filtering should be applied automatically
  • alpha
    the significance cutoff used for optimizing the independent filtering (by default 0.1). If the adjusted p-value cutoff (FDR) will be a value other than 0.1, alpha should be set to that value.(用于優(yōu)化獨(dú)立篩選的顯著性截止值(默認(rèn)情況下為0.1)富蓄。如果adjusted p-value cutoff (FDR)是0.1以外的值,則α應(yīng)設(shè)置為該值慢逾。)
  • two conditions, three genotypes, with an interaction term. (2種條件立倍,3種基因型,和相互作用項(xiàng)) The interaction term, answering: is the condition effect different across genotypes?(相互作用項(xiàng)解釋了條件的效果在基因型之間是否不同)
dds <- makeExampleDESeqDataSet(n=100,m=18)
dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds$condition
[1] A A A A A A A A A B B B B B B B B B
Levels: A B
dds$genotype
[1] I   I   I   II  II  II  III   III   III   I   I   I   II  II  II  III  III  III
Levels: I   II   III
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"              "genotype_II_vs_I" 
[3] "genotype_III_vs_I"      "condition_B_vs_A" 
[5] "genotypeII.conditionB"  "genotypeIII.conditionB"
# the condition effect for genotype I (the main effect)(基因型I的條件效應(yīng)侣滩,主要效應(yīng))
results(dds, contrast=c("condition","B","A"))
log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A
# the condition effect for genotype III.(基因型III的條件效應(yīng))
# this is the main effect *plus* the interaction term (主要效應(yīng)加上相互作用項(xiàng))
# (the extra condition effect in genotype III compared to genotype I).(基因型III與基因型I相比額外的條件作用效果)
results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") ))
log2 fold change (MLE): condition_B_vs_A+genotypeIII.conditionB effect 
Wald test p-value: condition_B_vs_A+genotypeIII.conditionB effect
# the interaction term for condition effect in genotype III vs genotype I.(基因型III與基因型I相比額外的條件作用效果)
# this tests if the condition effect is different in III compared to I(檢測(cè)了條件效果在基因型III與基因型I之間是否有區(qū)別)
results(dds, name="genotypeIII.conditionB")
log2 fold change (MLE): genotypeIII.conditionB 
Wald test p-value: genotypeIII.conditionB 
# the interaction term for condition effect in genotype III vs genotype II. (基因型III與基因型II相比額外的條件作用效果)
# this tests if the condition effect is different in III compared to II(檢測(cè)了條件效果在基因型III與基因型II之間是否有區(qū)別)
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB 
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB
# Note that a likelihood ratio could be used to test if there are any
# differences in the condition effect between the three genotypes.
  • Using a grouping variable
# This is a useful construction when users just want to compare
# specific groups which are combinations of variables.
dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
# the condition effect for genotypeIII
results(dds, contrast=c("group", "IIIB", "IIIA"))

(4)結(jié)果的簡(jiǎn)單統(tǒng)計(jì)

summary(res) 
summary
summary(res, alpha=0.1) 
  • alpha: the adjusted p-value cutoff. If not set, this defaults to the alpha argument which was used in results to set the target FDR for independent filtering, or if independent filtering was not performed, to 0.1.

(5)排序和篩選

resOrdered <- res[order(res$pvalue), ]  #從小到大排序口注,默認(rèn)decreasing = F 
sum(res$padj < 0.1, na.rm=TRUE)  #有多少padj小于0.1的 
diff_gene <-subset(res, padj < 0.1 & abs(log2FoldChange) > 1) 

(6)2種更嚴(yán)格的方法篩選顯著差異基因

  • 降低false discovery rate threshold (the threshold on padj in the results table)
  • 提升log2 fold change threshold (from 0 using the lfc Threshold argument of results)
res.05 <- results(dds, alpha=0.05) 
#alpha為padj的閾值,默認(rèn)padj=0.1君珠。
resLFC <- results(dds, lfcThreshold=1) 
#提升log2 fold change threshold寝志,結(jié)果中不滿足lfc閾值的gene的p值都是1。

7. LFC校正lfcShrink

(1)介紹

Adds shrunken log2 fold changes (LFC) and SE to a results table from DESeq run without LFC shrinkage. For consistency with results, the column name lfcSE is used here although what is returned is a posterior SD. Three shrinkage estimators for LFC are available via type (see the
vignette for more details on the estimators). The apeglm publication demonstrates that ’apeglm’ and ’ashr’ outperform the original ’normal’ shrinkage estimator.

The shrunken fold changes are useful for ranking genes by effect size and for visualization.
縮小的倍數(shù)變化有助于按效應(yīng)大小對(duì)基因進(jìn)行排序和可視化策添。
log2FC estimates do not account for the large dispersion we observe with low read counts.
log2 FC估計(jì)不能解釋我們?cè)诘蛂ead counts下觀察到的大的離散程度材部。
As with the shrinkage of dispersion estimates, LFC shrinkage uses information from all genes to generate more accurate estimates.
與估計(jì)離散程度的收縮一樣,LFC收縮使用來(lái)自所有基因的信息來(lái)生成更準(zhǔn)確的估計(jì)唯竹。
如果要根據(jù)LFC值提取差異基因乐导,需要shrunken values。另外浸颓,進(jìn)行功能分析例如GSEA時(shí)物臂,需要提供shrunken values。

(2)應(yīng)用

resultsNames(dds) 
[1] "Intercept"  "condition_treated_vs_untreated"
resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm") 
#或
resLFC <- lfcShrink(dds, contrast = c("condition"," treated "," untreated "))
  • 選擇的apeglm參數(shù)進(jìn)行effect size shrinkage猾愿,改善了先前的估計(jì)鹦聪。resLFC相比res數(shù)據(jù)更加緊湊账阻。type可以選擇apeglm蒂秘、ashr等。
names(resLFC) 
[1] "baseMean"  "log2FoldChange"  "lfcSE"  "pvalue"  "padj"       
  • 與res相比少了"stat"一列淘太。這一步只會(huì)對(duì)LFC的值產(chǎn)生影響姻僧,p值是沒有改變的规丽,不會(huì)改變顯著差異的基因總數(shù)。

(3)用法

lfcShrink(dds, coef, contrast, res, type = c("normal", "apeglm", "ashr"), lfcThreshold = 0, svalue = FALSE, return List = FALSE, format = c("DataFrame", "GRanges", "GRangesList"), apeAdapt = TRUE, apeMethod = "nbinomCR", parallel = FALSE, BPPARAM = bpparam(), quiet = FALSE, ...)

  • coef
    the name or number of the coefficient (LFC) to shrink, consult resultsNames(dds) after running DESeq(dds). note: only coef or contrast can be specified, not both. apeglm requires use of coef. For normal, one of coef or contrast must be provided.(要收縮的系數(shù)的名稱或編號(hào)撇贺,通過(guò)resultsNames(dds)查看赌莺。coef或contrast二選一。apeglm需要coef松嘶。)
  • contrast
    see argument description in results. only coef or contrast can be specified, not both.
  • type
    "normal" is the original DESeq2 shrinkage estimator; "apeglm" is the adaptive t prior shrinkage estimator from the ’apeglm’ package; "ashr" is the adaptive shrinkage estimator from the ’ashr’ package, using a fitted mixture of normal prior.

8. 似然比檢驗(yàn)LRT

ddsLRT <- DESeq(dds, test="LRT", reduced=~1)
resLRT <- results(ddsLRT)

9. 開啟多線程

library("BiocParallel")
register(MulticoreParam(4))
#先預(yù)定4個(gè)核艘狭,等需要的時(shí)候直接使用parallel=TRUE來(lái)調(diào)用。

10. Plotting results

(1)樣本間關(guān)系熱圖(總體相似度)

library(pheatmap)
library(RColorBrewer)
rld
sampleDist <- dist(t(assay(rld)))  #樣品距離,歐氏距離,t轉(zhuǎn)置 
#為確保所有基因大致相同的contribution用rlog-transformed data 
#畫某些基因在樣本間的heatmap也可以用rlog數(shù)據(jù) 
#用PoiClaClu包計(jì)算泊松距離(Poisson Distance)翠订,必須是原始表達(dá)矩陣 
#poisd <- PoissonDistance(t(counts(dds))) 
sampleDistMatrix <- as.matrix(sampleDist)  #樣品間距離的矩陣
rownames(sampleDistMatrix) <- paste0(rld $cell,"-", rld$dex)
colnames(sampleDistMatrix) <- NULL
head(sampleDistMatrix)  #樣品間距離的數(shù)據(jù)框
colors <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDist,
         clustering_distance_cols=sampleDist,
         color = colors)

(2)多維尺度分析(multidimensional scaling巢音,MDS)或主坐標(biāo)分析(principal coordinates analysis,PCoA)

library(ggplot2)
#把樣本之間的距離轉(zhuǎn)化成二維坐標(biāo),在降維過(guò)程中保證樣品點(diǎn)之間的距離不變
#MDS基于最小線性距離(歐氏距離)的聚類,與PCA的最大線性相關(guān)是一樣的
#適合在沒有表達(dá)矩陣值,但只有一個(gè)距離矩陣的情況下使用
mdsdata <- data.frame(cmdscale(sampleDistMatrix))
#cmdscale(classical multidimensional scaling)
mdsdata  #返回MDS的坐標(biāo)
mds <- cbind(mdsdata,as.data.frame(colData(vsd)))
mds  #按列合并
ggplot(data=mds,aes(X1,X2,color=cell,shape=dex)) +
  geom_point(size=3)

(3)主成分分析(Principal Component Analysis尽超,PCA)

pcadata <- plotPCA(vsd,intgroup = c("Batch","Group"), returnData=TRUE)
percentVar <- round(100*attr(pcadata,"percentVar"),1)
ggplot(pcadata, aes(PC1, PC2, color=Group, shape=Batch)) + 
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  geom_text_repel(aes(PC1, PC2,color=Group,label=colnames(vsd)),size=3) +
  theme_bw()

(4)plotCounts()函數(shù)查看特定基因的表達(dá)量

topGene <- rownames(res)[which.min(res$padj)]  #padj最小的一個(gè)基因
plotCounts(dds, gene=topGene, intgroup=c("dex"))  #畫出這個(gè)基因的標(biāo)準(zhǔn)化后的表達(dá)量
#以散點(diǎn)圖的形式畫出這個(gè)基因在各樣本中的表達(dá)量
data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +  
 scale_y_log10() +
 geom_point(position=position_jitter(width=.1,height=0), size=3)  
ggplot(data, aes(x=dex, y=count, fill=dex)) +
 scale_y_log10() +
 geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
 scale_y_log10() + geom_point(size=3) + geom_line()

(5)MA圖

plotMA(res, ylim=c(-5,5))
#"M" for minus(減), because a log ratio is equal to log minus log, and "A" for average(均值)
#M對(duì)應(yīng)差異對(duì)比組之間基因表達(dá)變化log2 fold changes (Y軸)
#A對(duì)應(yīng)差異對(duì)比組基因表達(dá)量均值the mean of normalized counts (X軸)
plotMA(res, alpha = 0.1, main = "", xlab = "mean of normalized counts", ylim=c(-5,5))
#alpha為padj顯著性水平閾值官撼,默認(rèn)alpha=0.1
#Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red.
plotMA(resLFC, ylim=c(-5,5))  #提高了log2 fold change閾值
topGene <- rownames(resLFC)[which.min(resLFC$padj)]
with(resLFC[topGene, ], {
 points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
 text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})  #標(biāo)記出一個(gè)特定的基因

(6)Removing hidden batch effects

library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市似谁,隨后出現(xiàn)的幾起案子傲绣,更是在濱河造成了極大的恐慌,老刑警劉巖巩踏,帶你破解...
    沈念sama閱讀 218,941評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件秃诵,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡塞琼,警方通過(guò)查閱死者的電腦和手機(jī)顷链,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,397評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)屈梁,“玉大人嗤练,你說(shuō)我怎么就攤上這事≡谘龋” “怎么了煞抬?”我有些...
    開封第一講書人閱讀 165,345評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)构哺。 經(jīng)常有香客問我革答,道長(zhǎng),這世上最難降的妖魔是什么曙强? 我笑而不...
    開封第一講書人閱讀 58,851評(píng)論 1 295
  • 正文 為了忘掉前任残拐,我火速辦了婚禮,結(jié)果婚禮上碟嘴,老公的妹妹穿的比我還像新娘溪食。我一直安慰自己,他們只是感情好娜扇,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,868評(píng)論 6 392
  • 文/花漫 我一把揭開白布错沃。 她就那樣靜靜地躺著栅组,像睡著了一般。 火紅的嫁衣襯著肌膚如雪枢析。 梳的紋絲不亂的頭發(fā)上玉掸,一...
    開封第一講書人閱讀 51,688評(píng)論 1 305
  • 那天,我揣著相機(jī)與錄音醒叁,去河邊找鬼司浪。 笑死,一個(gè)胖子當(dāng)著我的面吹牛把沼,可吹牛的內(nèi)容都是我干的断傲。 我是一名探鬼主播,決...
    沈念sama閱讀 40,414評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼智政,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼认罩!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起续捂,我...
    開封第一講書人閱讀 39,319評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤垦垂,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后牙瓢,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體劫拗,經(jīng)...
    沈念sama閱讀 45,775評(píng)論 1 315
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,945評(píng)論 3 336
  • 正文 我和宋清朗相戀三年矾克,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了页慷。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,096評(píng)論 1 350
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡胁附,死狀恐怖酒繁,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情控妻,我是刑警寧澤州袒,帶...
    沈念sama閱讀 35,789評(píng)論 5 346
  • 正文 年R本政府宣布,位于F島的核電站弓候,受9級(jí)特大地震影響郎哭,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜菇存,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,437評(píng)論 3 331
  • 文/蒙蒙 一夸研、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧依鸥,春花似錦亥至、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,993評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)井辆。三九已至关筒,卻和暖如春溶握,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背蒸播。 一陣腳步聲響...
    開封第一講書人閱讀 33,107評(píng)論 1 271
  • 我被黑心中介騙來(lái)泰國(guó)打工睡榆, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人袍榆。 一個(gè)月前我還...
    沈念sama閱讀 48,308評(píng)論 3 372
  • 正文 我出身青樓胀屿,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親包雀。 傳聞我的和親對(duì)象是個(gè)殘疾皇子宿崭,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,037評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容