DESeq2差異基因分析和批次效應(yīng)移除

差異基因鑒定

基因表達(dá)標(biāo)準(zhǔn)化

不同樣品的測(cè)序量會(huì)有差異，最簡(jiǎn)單的標(biāo)準(zhǔn)化方式是計(jì)算
counts per million (CPM)狐血，即原始reads count除以總reads數(shù)乘以1,000,000疏唾。

這種計(jì)算方式的缺點(diǎn)是容易受到極高表達(dá)且在不同樣品中存在差異表達(dá)的基因的影響冠跷；這些基因的打開(kāi)或關(guān)閉會(huì)影響到細(xì)胞中總的分子數(shù)目荡含，可能導(dǎo)致這些基因標(biāo)準(zhǔn)化之后就不存在表達(dá)差異了火窒，而原本沒(méi)有差異的基因標(biāo)準(zhǔn)化之后卻有差異了零酪。

RPKM冒嫡、FPKM和TPM是CPM按照基因或轉(zhuǎn)錄本長(zhǎng)度歸一化后的表達(dá)，也會(huì)受到這一影響四苇。

calc_cpm <- function (expr_mat, spikes = NULL){
  norm_factor <- colSums(expr_mat[-spikes,])
  return(t(t(expr_mat)/norm_factor)) * 10^6
}

為了解決這一問(wèn)題孝凌，研究者提出了其它的標(biāo)準(zhǔn)化方法。

量化因子 (size factor,SF)是由DESeq提出的月腋。其方法是首先計(jì)算每個(gè)基因在所有樣品中表達(dá)的幾何平均值蟀架。每個(gè)細(xì)胞的量化因子(size
factor)是所有基因與其在所有樣品中的表達(dá)值的幾何平均值的比值的中位數(shù)。由于幾何平均值的使用榆骚，只有在所有樣品中表達(dá)都不為0的基因才能用來(lái)計(jì)算片拍。這一方法又被稱為RLE (relative log expression)。

calc_sf <- function (expr_mat, spikes=NULL){
  geomeans <- exp(rowMeans(log(expr_mat[-spikes,])))
  SF <- function(cnts){
    median((cnts/geomeans)[(is.finite(geomeans) & geomeans >0)])
  }
  norm_factor <- apply(expr_mat[-spikes,],2,SF)
  return(t(t(expr_mat)/norm_factor))
}

上四分位數(shù) (upperquartile,UQ)是樣品中所有基因的表達(dá)除以處于上四分位數(shù)的基因的表達(dá)值妓肢。同時(shí)為了保證表達(dá)水平的相對(duì)穩(wěn)定捌省，計(jì)算得到的上四分位數(shù)值要除以所有樣品中上四分位數(shù)值的中位數(shù)。

calc_uq <- function (expr_mat, spikes=NULL){
  UQ <- function(x) {
    quantile(x[x>0],0.75)
  }
  uq <- unlist(apply(expr_mat[-spikes,],2,UQ))
  norm_factor <- uq/median(uq)
  return(t(t(expr_mat)/norm_factor))
}

TMM是M-值的加權(quán)截尾均值碉钠。選定一個(gè)樣品為參照佑钾，其它樣品中基因的表達(dá)相對(duì)于參照樣品中對(duì)應(yīng)基因表達(dá)倍數(shù)的log2值定義為M-值奄抽。隨后去除M-值中最高和最低的30%除秀，剩下的M值計(jì)算加權(quán)平均值纺蛆。每一個(gè)非參照樣品的基因表達(dá)值都乘以計(jì)算出的TMM。這個(gè)方法假設(shè)大部分基因的表達(dá)是沒(méi)有差異的污筷。

DESeq2 差異基因鑒定一步法

為了簡(jiǎn)化差異基因的運(yùn)算工闺，易生信做了腳本封裝，DESeq2.sh瓣蛀，只需提供原始基因表達(dá)矩陣陆蟆、樣品分組信息表即可進(jìn)行差異基因分析和鑒定。

DESeq2.sh
OPTIONS:
    -f  Data file [A gene count matrix, NECESSARY]
        CHECK ABOVE FOR DETAILS
        
    -s  Sample file [A multiple columns file with header line, 
        For <timeseries>,  one <conditions> columns is needed.
        NECESSARY]
        CHECK ABOVE FOR DETAILS
        
    -d  The design formula for DESeqDataSetFromMatrix.
        [Default <conditions>, 
        accept <cell+time+cell:time> for example 2.]
        
    -D  The reduced design formula for DESeq.
        [Only applicable to <timeseries> analysis, 
        accept <cell+time> or <time> or <cell> for example 2.]
        
    -m  Specify the comparasion mode.
        [Default <pairwise>, accept <timeseries>,
        <pairwise> comparasion will still be done in <timeseries>
        mode.
        NECESSARY]
        
    -p  A file containing the pairs needed to do comparasion. 
        CHECK ABOVE FOR DETAILS
        All samples will be compared in <pairwise> mode if not specified here.
    
    -F  Log2 Fold change for screening DE genes.
        Default 1

    -P  FDR for screening DE genes.
        Default 0.01

    -q  FDR for screening time-series DE genes.
        Default 0.1

    -e  Execute programs
        Default TRUE

    -i  Install packages of not exist. 
        Default FALSE

Eg.
    DESeq2.sh -f matirx -s sample -d conditions
    DESeq2.sh -f matirx -s sample -p compare_pair

具體使用

cd ~/data
# ehbio_trans.Count_matrix.xls和sampleFile都是前面用到的文件
DESeq2.sh -f ehbio_trans.Count_matrix.xls -s sampleFile

[1] "Perform pairwise comparasion using <design=~conditions>"
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
[1] "Output normalized counts"
[1] "Output rlog transformed normalized counts"
[1] "Performing sample clustering"
null device 
          1 
[1] "DE genes between untrt trt"
[1] "PCA analysis"
null device

運(yùn)行結(jié)束即可獲得以下結(jié)果文件

# 具體的文件內(nèi)容和圖的樣式見(jiàn)后面的分步法文檔
# 原始輸入文件
ehbio_trans.Count_matrix.xls
sampleFile

# 所有差異基因列表
ehbio_trans.Count_matrix.xls.DESeq2.all.DE

# PCA結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pca.pdf

# 樣品相關(guān)性層級(jí)聚類結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf

# rlog轉(zhuǎn)換后的標(biāo)準(zhǔn)化后的表達(dá)結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls

# 標(biāo)準(zhǔn)化后的表達(dá)結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls

# 運(yùn)行腳本
ehbio_trans.Count_matrix.xls.DESeq2.r

# 差異基因結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.xls

# 火山圖和火山圖輸入數(shù)據(jù)
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls.volcano.pdf

一步法腳本可在https://ke.qq.com/course/301387?tuin=20cd7788或點(diǎn)擊閱讀原文獲得揪惦。

DESeq2 差異基因鑒定分步法

安裝包 `DESeq2`安裝方法如下

source("https://bioconductor.org/biocLite.R")
biocLite('BiocInstaller')
biocLite("DESeq2")
install.packages(c("gplots", "amap", "ggplot2"))

A distributional assumption is needed because we want to estimate the
probability of extreme events (large fold change just appearing by
chance) from limited replicates. The negative binomial (a.k.a.
Gamma-Poisson) is a good choice for RNA-seq experiments because

The observed data at gene level is inherently counts or estimated
counts of fragments for each feature and
The spread of values among biological replicates is more than given
by a simpler, one parameter distribution, the Poisson; and it seems
to be captured by the NB sufficiently well

加載包

library(DESeq2)

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following object is masked from 'package:base':
## 
##     apply

library("RColorBrewer")
library("gplots")

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:IRanges':
## 
##     space

## The following object is masked from 'package:S4Vectors':
## 
##     space

## The following object is masked from 'package:stats':
## 
##     lowess

library("amap")
library("ggplot2")
library("BiocParallel")

# 多線程加速, 使用4個(gè)核
# 如果你電腦性能強(qiáng)大遍搞，可以把這個(gè)數(shù)變大
register(MulticoreParam(4))

讀入數(shù)據(jù)

# 注意文件的位置，程序自己不知道文件在什么地方
# 如果只給文件名器腋，不給路徑溪猿，程序只會(huì)在當(dāng)前目錄查找文件
# 若找不到钩杰，則會(huì)報(bào)錯(cuò)
# 可以使用下面命令設(shè)置工作目錄到文件所在目錄
# 或提供文件全路徑
# setwd("~/data")
data <- read.table("ehbio_trans.Count_matrix.xls", header=T, row.names=1, com='', quote='',
    check.names=F, sep="\t")

# 撇掉在多于兩個(gè)樣本中count<1的值，如果樣本數(shù)多诊县，這個(gè)數(shù)值可以適當(dāng)增加
# 排除極低表達(dá)基因的干擾
data <- data[rowSums(data)>2,]
head(data)

##                 untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232           13            25            23            24
## ENSG00000278267            0             5             3             4
## ENSG00000268903            0             2             0             0
## ENSG00000269981            0             3             0             1
## ENSG00000241860            3            11             1             5
## ENSG00000279457           46            90            73            49
##                 trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232         12          12          22          22
## ENSG00000278267          2           4           3           1
## ENSG00000268903          0           2           0           3
## ENSG00000269981          0           2           0           0
## ENSG00000241860          3           2           0           2
## ENSG00000279457         52          46          89          31

# 讀入分組信息
sample <- read.table("sampleFile", header=T, row.names=1, com='',
    quote='', check.names=F, sep="\t", colClasses="factor")

sample <- sample[match(colnames(data), rownames(sample)),, drop=F]
sample_rowname <- rownames(sample)

# 下面的可以忽略讲弄，如果沒(méi)遇到錯(cuò)誤不需要執(zhí)行
# 目的是做因子轉(zhuǎn)換
sample <- data.frame(lapply(sample, function(x) factor(x, levels=unique(x))))
rownames(sample) <- sample_rowname
sample

##               conditions
## untrt_N61311       untrt
## untrt_N052611      untrt
## untrt_N080611      untrt
## untrt_N061011      untrt
## trt_N61311           trt
## trt_N052611          trt
## trt_N080611          trt
## trt_N061011          trt

產(chǎn)生DESeq數(shù)據(jù)集

DESeq2包采用DESeqDataSet存儲(chǔ)原始的read count和中間計(jì)算的統(tǒng)計(jì)量。

每個(gè)DESeqDataSet對(duì)象都要有一個(gè)實(shí)驗(yàn)設(shè)計(jì)formula依痊，用于對(duì)數(shù)據(jù)進(jìn)行分組避除，以便計(jì)算表達(dá)值的離散度和估計(jì)表達(dá)倍數(shù)差異，通常格式為~ batch + conditions
(為了方便后續(xù)計(jì)算胸嘁，最為關(guān)注的分組信息放在最后一位)瓶摆。

countData: 表達(dá)矩陣

colData: 樣品分組信息表

design: 實(shí)驗(yàn)設(shè)計(jì)信息，conditions必須是colData中的一列

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
        colData = sample,  design= ~ conditions)

dds <- DESeq(ddsFullCountTable)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

獲取標(biāo)準(zhǔn)化后的數(shù)據(jù)

# ?counts查看此函數(shù)功能
# normalized=T, 返回標(biāo)準(zhǔn)化的數(shù)據(jù)
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)

##                 untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232    12.730963     21.179286    19.4979982     25.994727
## ENSG00000278267     0.000000      4.235857     2.5432172      4.332454
## ENSG00000268903     0.000000      1.694343     0.0000000      0.000000
## ENSG00000269981     0.000000      2.541514     0.0000000      1.083114
## ENSG00000241860     2.937915      9.318886     0.8477391      5.415568
## ENSG00000279457    45.048024     76.245430    61.8849507     53.072567
##                 trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232  13.423907   17.885811   15.750664   23.250144
## ENSG00000278267   2.237318    5.961937    2.147818    1.056825
## ENSG00000268903   0.000000    2.980969    0.000000    3.170474
## ENSG00000269981   0.000000    2.980969    0.000000    0.000000
## ENSG00000241860   3.355977    2.980969    0.000000    2.113649
## ENSG00000279457  58.170264   68.562276   63.718596   32.761567

根據(jù)基因在不同的樣本中表達(dá)變化的差異程度mad值對(duì)數(shù)據(jù)排序性宏，差異越大的基因排位越前群井。

normalized_counts_mad <- apply(normalized_counts, 1, mad)
normalized_counts <- normalized_counts[order(normalized_counts_mad, decreasing=T), ]

# 標(biāo)準(zhǔn)化后的數(shù)據(jù)輸出
write.table(normalized_counts, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls",
quote=F, sep="\t", row.names=T, col.names=T)

# 只在Linux下能運(yùn)行，目的是填補(bǔ)表格左上角內(nèi)容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls"))

# log轉(zhuǎn)換后的結(jié)果
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- rlogMat[order(normalized_counts_mad, decreasing=T), ]


write.table(rlogMat, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls",
quote=F, sep="\t", row.names=T, col.names=T)

# 只在Linux下能運(yùn)行毫胜，目的是填補(bǔ)表格左上角內(nèi)容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls"))

樣品層級(jí)聚類分析书斜，判斷樣品的相似性和組間組內(nèi)差異

# 生成顏色
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

# 計(jì)算相關(guān)性pearson correlation
pearson_cor <- as.matrix(cor(rlogMat, method="pearson"))

# 層級(jí)聚類
hc <- hcluster(t(rlogMat), method="pearson")

# 熱圖繪制
## 在命令行下運(yùn)行時(shí)，需要去除下面開(kāi)頭的#號(hào)酵使，把輸出的圖保存到文件中
## 輸出到文件時(shí)荐吉，dev.off()命令是關(guān)閉輸出，從而完成圖片的寫(xiě)入口渔。如果不做這一步样屠，則圖片則不能寫(xiě)入，從而不能打開(kāi)
## 如果在Rstudio或其它可視化界面時(shí)搓劫，可以直接把圖輸出到屏幕
#pdf("ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf", pointsize=10)
heatmap.2(pearson_cor, Rowv=as.dendrogram(hc), symm=T, trace="none",
col=hmcol, margins=c(11,11), main="The pearson correlation of each
sample")

image

#dev.off()

樣品PCA分析

pca_data <- plotPCA(rld, intgroup=c("conditions"), returnData=T, ntop=5000)

差異基因分析

# 定義變量的好處是下面的代碼就獨(dú)立了
# 如果想比較不同的組瞧哟，只需在這修改即可混巧，后面代碼都可以不動(dòng)
sampleA = "untrt"
sampleB = "trt"

results函數(shù)提取差異基因分析結(jié)果枪向，包含log2 fold changes,
p values和adjusted p values.
constrast用于指定比較的兩組的信息，輸出的log2FoldChange為log2(SampleA/SampleB)咧党。

contrastV <- c("conditions", sampleA, sampleB)
res <- results(dds,  contrast=contrastV)
res

## log2 fold change (MLE): conditions untrt vs trt 
## Wald test p-value: conditions untrt vs trt 
## DataFrame with 26280 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE        stat    pvalue
##                  <numeric>      <numeric> <numeric>   <numeric> <numeric>
## ENSG00000227232 18.7141876     0.17695376 0.3809406  0.46451794 0.6422767
## ENSG00000278267  2.8144283     0.02802497 1.0249422  0.02734297 0.9781862
## ENSG00000268903  0.9807232    -1.71669654 2.3186888 -0.74037385 0.4590732
## ENSG00000269981  0.8256996     0.59077960 2.4538911  0.24075217 0.8097472
## ENSG00000241860  3.3713378     1.21773503 1.0789466  1.12863330 0.2590526
## ...                    ...            ...       ...         ...       ...
## 29256            0.3507226      1.8471273  3.484683  0.53007036 0.5960631
## 36308            0.9279831      0.1394489  1.598343  0.08724589 0.9304761
## 8414             0.7459738     -0.2678522  2.053434 -0.13044109 0.8962175
## 28837            1.2626343      0.2689443  1.405958  0.19128899 0.8482992
## 35320            1.0041030      0.5493562  1.717832  0.31979625 0.7491228
##                      padj
##                 <numeric>
## ENSG00000227232 0.8538515
## ENSG00000278267        NA
## ENSG00000268903        NA
## ENSG00000269981        NA
## ENSG00000241860        NA
## ...                   ...
## 29256                  NA
## 36308                  NA
## 8414                   NA
## 28837                  NA
## 35320                  NA

給DESeq2的原始輸出結(jié)果增加樣品平均表達(dá)信息秘蛔，使得結(jié)果更容易理解和解析。

# 獲得第一組數(shù)據(jù)均值
baseA <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleA]
    
if (is.vector(baseA)){
    baseMeanA <- as.data.frame(baseA)
} else {
    baseMeanA <- as.data.frame(rowMeans(baseA))
}
colnames(baseMeanA) <- sampleA
head(baseMeanA)

##                      untrt
## ENSG00000227232 19.8507436
## ENSG00000278267  2.7778822
## ENSG00000268903  0.4235857
## ENSG00000269981  0.9061570
## ENSG00000241860  4.6300269
## ENSG00000279457 59.0627429

# 獲得第二組數(shù)據(jù)均值
baseB <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleB]
if (is.vector(baseB)){
        baseMeanB <- as.data.frame(baseB)
} else {
        baseMeanB <- as.data.frame(rowMeans(baseB))
}
colnames(baseMeanB) <- sampleB
head(baseMeanB)

##                        trt
## ENSG00000227232 17.5776316
## ENSG00000278267  2.8509744
## ENSG00000268903  1.5378607
## ENSG00000269981  0.7452421
## ENSG00000241860  2.1126487
## ENSG00000279457 55.8031756

# 結(jié)果組合
res <- cbind(baseMeanA, baseMeanB, as.data.frame(res))
head(res)

##                      untrt        trt   baseMean log2FoldChange     lfcSE
## ENSG00000227232 19.8507436 17.5776316 18.7141876     0.17695376 0.3809406
## ENSG00000278267  2.7778822  2.8509744  2.8144283     0.02802497 1.0249422
## ENSG00000268903  0.4235857  1.5378607  0.9807232    -1.71669654 2.3186888
## ENSG00000269981  0.9061570  0.7452421  0.8256996     0.59077960 2.4538911
## ENSG00000241860  4.6300269  2.1126487  3.3713378     1.21773503 1.0789466
## ENSG00000279457 59.0627429 55.8031756 57.4329592     0.08926886 0.2929300
##                        stat    pvalue      padj
## ENSG00000227232  0.46451794 0.6422767 0.8538515
## ENSG00000278267  0.02734297 0.9781862        NA
## ENSG00000268903 -0.74037385 0.4590732        NA
## ENSG00000269981  0.24075217 0.8097472        NA
## ENSG00000241860  1.12863330 0.2590526        NA
## ENSG00000279457  0.30474470 0.7605606 0.9130580

# 增加ID信息
res <- cbind(ID=rownames(res), as.data.frame(res))
res$baseMean <- rowMeans(cbind(baseA, baseB))

# 校正后p-value為NA的復(fù)制為1
res$padj[is.na(res$padj)] <- 1

# 按pvalue排序, 把差異大的基因放前面
res <- res[order(res$pvalue),]
head(res)

##                              ID     untrt       trt   baseMean
## ENSG00000152583 ENSG00000152583   77.8512  1900.412   989.1314
## ENSG00000148175 ENSG00000148175 7233.0010 19483.552 13358.2766
## ENSG00000179094 ENSG00000179094  151.7491  1380.881   766.3151
## ENSG00000134686 ENSG00000134686 1554.0390  4082.440  2818.2395
## ENSG00000125148 ENSG00000125148 1298.0392  5958.946  3628.4926
## ENSG00000120129 ENSG00000120129  773.9769  5975.522  3374.7494
##                 log2FoldChange      lfcSE      stat        pvalue
## ENSG00000152583      -4.608311 0.21090819 -21.84984 7.800059e-106
## ENSG00000148175      -1.429585 0.08639329 -16.54741  1.671380e-61
## ENSG00000179094      -3.184674 0.20065896 -15.87108  1.005030e-56
## ENSG00000134686      -1.392749 0.09190323 -15.15451  7.073605e-52
## ENSG00000125148      -2.199191 0.14796048 -14.86337  5.698869e-50
## ENSG00000120129      -2.948122 0.19931242 -14.79146  1.663007e-49
##                          padj
## ENSG00000152583 1.331002e-101
## ENSG00000148175  1.426021e-57
## ENSG00000179094  5.716612e-53
## ENSG00000134686  3.017600e-48
## ENSG00000125148  1.944910e-46
## ENSG00000120129  4.729591e-46

整體分析結(jié)果輸出到文件

comp314 <- paste(sampleA, "_vs_", sampleB, sep=".")

# 生成文件名
file_base <- paste("ehbio_trans.Count_matrix.xls.DESeq2", comp314, sep=".")
file_base1 <- paste(file_base, "results.xls", sep=".")
write.table(as.data.frame(res), file=file_base1, sep="\t", quote=F, row.names=F)

提取差異表達(dá)基因

# 差異基因篩選傍衡，padj<0.1
res_de <- subset(res, res$padj<0.1, select=c('ID', sampleA,
        sampleB, 'log2FoldChange', 'padj'))
# foldchang > 1
res_de_up <- subset(res_de, res_de$log2FoldChange>=1)

file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA,"_higherThan_",sampleB,
        'xls', sep=".") 
write.table(as.data.frame(res_de_up), file=file, sep="\t", quote=F, row.names=F)

res_de_dw <- subset(res_de, res_de$log2FoldChange<=(-1)*1)

file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA, "_lowerThan_", sampleB, 
        'xls', sep=".") 
write.table(as.data.frame(res_de_dw), file=file, sep="\t", quote=F, row.names=F)

# 差異基因ID
res_de_up_id = data.frame(ID=res_de_up$ID, 
        type=paste(sampleA,"_higherThan_", sampleB, sep="."))
res_de_dw_id = data.frame(ID=res_de_dw$ID, 
        type=paste(sampleA,"_lowerThan_", sampleB, sep="."))
de_id = rbind(res_de_up_id, res_de_dw_id)

file <- "ehbio_trans.Count_matrix.xls.DESeq2.all.DE"
write.table(as.data.frame(de_id), file=file, sep="\t", quote=F, row.names=F, col.names=F)

繪制火山圖

# 可以把前面生成的results.xls文件提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
logCounts <- log2(res$baseMean+1)
logFC <- res$log2FoldChange
FDR <- res$padj
#png(filename=paste(file_base, "Volcano.png", sep="."))
plot(logFC, -1*log10(FDR), col=ifelse(FDR<=0.01, "red", "black"),
 xlab="logFC", ylab="-1*log1o(FDR)", main="Volcano plot", pch=".")
#dev.off()

image

差異基因熱圖

res_de_up_top20_id <- as.vector(head(res_de_up$ID,20))
res_de_dw_top20_id <- as.vector(head(res_de_dw$ID,20))

red_de_top20 <- c(res_de_up_top20_id, res_de_dw_top20_id)
red_de_top20

##  [1] "ENSG00000178695" "ENSG00000196517" "ENSG00000116584"
##  [4] "ENSG00000144369" "ENSG00000276600" "ENSG00000108821"
##  [7] "ENSG00000162692" "ENSG00000181467" "ENSG00000145777"
## [10] "ENSG00000163491" "ENSG00000183160" "ENSG00000172986"
## [13] "ENSG00000164484" "ENSG00000131389" "ENSG00000134253"
## [16] "ENSG00000124766" "ENSG00000227051" "ENSG00000146006"
## [19] "ENSG00000112837" "ENSG00000146592" "ENSG00000152583"
## [22] "ENSG00000148175" "ENSG00000179094" "ENSG00000134686"
## [25] "ENSG00000125148" "ENSG00000120129" "ENSG00000189221"
## [28] "ENSG00000109906" "ENSG00000101347" "ENSG00000162614"
## [31] "ENSG00000096060" "ENSG00000162616" "ENSG00000166741"
## [34] "ENSG00000183044" "ENSG00000164125" "ENSG00000198624"
## [37] "ENSG00000256235" "ENSG00000077684" "ENSG00000135821"
## [40] "ENSG00000164647"

# 可以把矩陣存儲(chǔ)深员，提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
red_de_top20_expr <- normalized_counts[rownames(normalized_counts) %in% red_de_top20,]
library(pheatmap)
pheatmap(red_de_top20_expr, cluster_row=T, scale="row", annotation_col=sample)

image

關(guān)于批次效應(yīng)

見(jiàn)論壇討論 http://www.ehbio.com/Esx/d/10-deseq2。

每個(gè)DESeqDataSet對(duì)象都要有一個(gè)實(shí)驗(yàn)設(shè)計(jì)formula蛙埂，用于對(duì)數(shù)據(jù)進(jìn)行分組倦畅，以便計(jì)算表達(dá)值的離散度和估計(jì)表達(dá)倍數(shù)差異，通常格式為~ batch + conditions
(為了方便后續(xù)計(jì)算绣的，最為關(guān)注的分組信息放在最后一位)叠赐。

如果記錄了樣本的批次信息欲账，或者其它需要抹除的信息可以定義在design參數(shù)中，在下游回歸的分析中會(huì)根據(jù)design formula來(lái)估計(jì)batch effect的影響芭概，并在下游分析時(shí)減去這個(gè)影響赛不。這是處理batch effect的推薦方式。

在模型中考慮batch effect并沒(méi)有在數(shù)據(jù)矩陣中移除bacth effect罢洲，如果下游處理時(shí)踢故，確實(shí)有需要可以使用limma包的removeBatchEffect來(lái)處理。

countData: 表達(dá)矩陣

colData: 樣品分組信息表

design: 實(shí)驗(yàn)設(shè)計(jì)信息惹苗，batch和conditions必須是colData中的一列

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
        colData = sample,  design= ~ batch + conditions)

dds <- DESeq(ddsFullCountTable)

rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- limma::removeBatchEffect(rlogMat, c(sample$batch))

Just to be clear, there's an important difference between removing a
batch effect and modelling a batch effect. Including the batch in your
design formula will model the batch effect in the regression step, which
means that the raw data are not modified (so the batch effect is not
removed), but instead the regression will estimate the size of the batch
effect and subtract it out when performing all other tests. In addition,
the model's residual degrees of freedom will be reduced appropriately to
reflect the fact that some degrees of freedom were "spent" modelling the
batch effects. This is the preferred approach for any method that is
capable of using it (this includes DESeq2). You would only remove the
batch effect (e.g. using limma's removeBatchEffect function) if you were
going to do some kind of downstream analysis that can't model the batch
effects, such as training a classifier.

批次效應(yīng)模擬

#Make some simulated data with a batch effect:
dds <- makeExampleDESeqDataSet(betaSD=1,interceptMean=10)
dds$batch <- factor(rep(c("A","B"),each=6))

#VST, remove batch effect, then plotPCA:

vsd <- vst(dds)
plotPCA(vsd, "batch")

image

assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd, "batch")

image

SVA(批次未記錄時(shí)殿较，尋找潛在影響因子，并矯正)

dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
##calculating the variables
n.sv <- num.sv(dat,mod,method="leek") #gives 11
### using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)
#http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
#http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects

參考資料

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
http://dx.doi.org/10.1186/s13059-014-0550-8

最后編輯于：2018.07.26 15:54:12

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末桩蓉，一起剝皮案震驚了整個(gè)濱河市斜脂，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌触机，老刑警劉巖帚戳，帶你破解...
沈念sama閱讀 216,372評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異儡首，居然都是意外死亡片任，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,368評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)蔬胯，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)对供，“玉大人，你說(shuō)我怎么就攤上這事氛濒〔。” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 162,415評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵舞竿，是天一觀的道長(zhǎng)京景。經(jīng)常有香客問(wèn)我，道長(zhǎng)骗奖，這世上最難降的妖魔是什么确徙？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,157評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮执桌，結(jié)果婚禮上鄙皇，老公的妹妹穿的比我還像新娘。我一直安慰自己仰挣，他們只是感情好伴逸，可當(dāng)我...
茶點(diǎn)故事閱讀 67,171評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著膘壶，像睡著了一般错蝴。火紅的嫁衣襯著肌膚如雪博烂。梳的紋絲不亂的頭發(fā)上，一...
開(kāi)封第一講書(shū)人閱讀 51,125評(píng)論 1贊 297
城市分裂傳說(shuō)
那天漱竖，我揣著相機(jī)與錄音禽篱，去河邊找鬼。笑死馍惹，一個(gè)胖子當(dāng)著我的面吹牛躺率，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播万矾，決...
沈念sama閱讀 40,028評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼悼吱，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了良狈？” 一聲冷哼從身側(cè)響起后添，我...
開(kāi)封第一講書(shū)人閱讀 38,887評(píng)論 0贊 274
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎薪丁，沒(méi)想到半個(gè)月后遇西，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,310評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡严嗜，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,533評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年粱檀，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片漫玄。...
茶點(diǎn)故事閱讀 39,690評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡茄蚯，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出睦优，到底是詐尸還是另有隱情渗常，我是刑警寧澤，帶...
沈念sama閱讀 35,411評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布汗盘，位于F島的核電站皱碘，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏衡未。R本人自食惡果不足惜尸执，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,004評(píng)論 3贊 325
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望缓醋。院中可真熱鬧，春花似錦绊诲、人聲如沸送粱。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,659評(píng)論 0贊 22
一樁弒父案掂之，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)抗俄。三九已至脆丁，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間动雹，已是汗流浹背槽卫。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,812評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留胰蝠，地道東北人歼培。一個(gè)月前我還...
沈念sama閱讀 47,693評(píng)論 2贊 368
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像茸塞，于是被迫代替她去往敵國(guó)和親躲庄。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,577評(píng)論 2贊 353

DESeq2差異基因分析和批次效應(yīng)移除

差異基因鑒定

基因表達(dá)標(biāo)準(zhǔn)化

DESeq2 差異基因鑒定一步法

DESeq2 差異基因鑒定分步法

安裝包 DESeq2安裝方法如下

加載包

讀入數(shù)據(jù)

產(chǎn)生DESeq數(shù)據(jù)集

獲取標(biāo)準(zhǔn)化后的數(shù)據(jù)

差異基因分析

關(guān)于批次效應(yīng)

參考資料

推薦閱讀更多精彩內(nèi)容

安裝包 `DESeq2`安裝方法如下