差異基因鑒定
基因表達(dá)標(biāo)準(zhǔn)化
不同樣品的測(cè)序量會(huì)有差異,最簡(jiǎn)單的標(biāo)準(zhǔn)化方式是計(jì)算
counts per million (CPM)
狐血,即原始reads count
除以總reads數(shù)乘以1,000,000疏唾。
這種計(jì)算方式的缺點(diǎn)是容易受到極高表達(dá)且在不同樣品中存在差異表達(dá)的基因的影響冠跷;這些基因的打開(kāi)或關(guān)閉會(huì)影響到細(xì)胞中總的分子數(shù)目荡含,可能導(dǎo)致這些基因標(biāo)準(zhǔn)化之后就不存在表達(dá)差異了火窒,而原本沒(méi)有差異的基因標(biāo)準(zhǔn)化之后卻有差異了零酪。
RPKM冒嫡、FPKM和TPM是CPM按照基因或轉(zhuǎn)錄本長(zhǎng)度歸一化后的表達(dá),也會(huì)受到這一影響四苇。
calc_cpm <- function (expr_mat, spikes = NULL){
norm_factor <- colSums(expr_mat[-spikes,])
return(t(t(expr_mat)/norm_factor)) * 10^6
}
為了解決這一問(wèn)題孝凌,研究者提出了其它的標(biāo)準(zhǔn)化方法。
量化因子 (size factor,SF)是由DESeq
提出的月腋。其方法是首先計(jì)算每個(gè)基因在所有樣品中表達(dá)的幾何平均值蟀架。每個(gè)細(xì)胞的量化因子(size
factor)是所有基因與其在所有樣品中的表達(dá)值的幾何平均值的比值的中位數(shù)。由于幾何平均值的使用榆骚,只有在所有樣品中表達(dá)都不為0的基因才能用來(lái)計(jì)算片拍。這一方法又被稱為RLE (relative log expression)。
calc_sf <- function (expr_mat, spikes=NULL){
geomeans <- exp(rowMeans(log(expr_mat[-spikes,])))
SF <- function(cnts){
median((cnts/geomeans)[(is.finite(geomeans) & geomeans >0)])
}
norm_factor <- apply(expr_mat[-spikes,],2,SF)
return(t(t(expr_mat)/norm_factor))
}
上四分位數(shù) (upperquartile,UQ)是樣品中所有基因的表達(dá)除以處于上四分位數(shù)的基因的表達(dá)值妓肢。同時(shí)為了保證表達(dá)水平的相對(duì)穩(wěn)定捌省,計(jì)算得到的上四分位數(shù)值要除以所有樣品中上四分位數(shù)值的中位數(shù)。
calc_uq <- function (expr_mat, spikes=NULL){
UQ <- function(x) {
quantile(x[x>0],0.75)
}
uq <- unlist(apply(expr_mat[-spikes,],2,UQ))
norm_factor <- uq/median(uq)
return(t(t(expr_mat)/norm_factor))
}
TMM是M-值的加權(quán)截尾均值碉钠。選定一個(gè)樣品為參照佑钾,其它樣品中基因的表達(dá)相對(duì)于參照樣品中對(duì)應(yīng)基因表達(dá)倍數(shù)的log2值定義為M-值奄抽。隨后去除M-值中最高和最低的30%除秀,剩下的M值計(jì)算加權(quán)平均值纺蛆。每一個(gè)非參照樣品的基因表達(dá)值都乘以計(jì)算出的TMM。這個(gè)方法假設(shè)大部分基因的表達(dá)是沒(méi)有差異的污筷。
DESeq2 差異基因鑒定一步法
為了簡(jiǎn)化差異基因的運(yùn)算工闺,易生信做了腳本封裝,DESeq2.sh
瓣蛀,只需提供原始基因表達(dá)矩陣陆蟆、樣品分組信息表即可進(jìn)行差異基因分析和鑒定。
DESeq2.sh
OPTIONS:
-f Data file [A gene count matrix, NECESSARY]
CHECK ABOVE FOR DETAILS
-s Sample file [A multiple columns file with header line,
For <timeseries>, one <conditions> columns is needed.
NECESSARY]
CHECK ABOVE FOR DETAILS
-d The design formula for DESeqDataSetFromMatrix.
[Default <conditions>,
accept <cell+time+cell:time> for example 2.]
-D The reduced design formula for DESeq.
[Only applicable to <timeseries> analysis,
accept <cell+time> or <time> or <cell> for example 2.]
-m Specify the comparasion mode.
[Default <pairwise>, accept <timeseries>,
<pairwise> comparasion will still be done in <timeseries>
mode.
NECESSARY]
-p A file containing the pairs needed to do comparasion.
CHECK ABOVE FOR DETAILS
All samples will be compared in <pairwise> mode if not specified here.
-F Log2 Fold change for screening DE genes.
Default 1
-P FDR for screening DE genes.
Default 0.01
-q FDR for screening time-series DE genes.
Default 0.1
-e Execute programs
Default TRUE
-i Install packages of not exist.
Default FALSE
Eg.
DESeq2.sh -f matirx -s sample -d conditions
DESeq2.sh -f matirx -s sample -p compare_pair
具體使用
cd ~/data
# ehbio_trans.Count_matrix.xls和sampleFile都是前面用到的文件
DESeq2.sh -f ehbio_trans.Count_matrix.xls -s sampleFile
[1] "Perform pairwise comparasion using <design=~conditions>"
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
[1] "Output normalized counts"
[1] "Output rlog transformed normalized counts"
[1] "Performing sample clustering"
null device
1
[1] "DE genes between untrt trt"
[1] "PCA analysis"
null device
運(yùn)行結(jié)束即可獲得以下結(jié)果文件
# 具體的文件內(nèi)容和圖的樣式見(jiàn)后面的分步法文檔
# 原始輸入文件
ehbio_trans.Count_matrix.xls
sampleFile
# 所有差異基因列表
ehbio_trans.Count_matrix.xls.DESeq2.all.DE
# PCA結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pca.pdf
# 樣品相關(guān)性層級(jí)聚類結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf
# rlog轉(zhuǎn)換后的標(biāo)準(zhǔn)化后的表達(dá)結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls
# 標(biāo)準(zhǔn)化后的表達(dá)結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls
# 運(yùn)行腳本
ehbio_trans.Count_matrix.xls.DESeq2.r
# 差異基因結(jié)果
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._higherThan_.trt.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.id.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._lowerThan_.trt.xls
# 火山圖和火山圖輸入數(shù)據(jù)
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls
ehbio_trans.Count_matrix.xls.DESeq2.untrt._vs_.trt.results.xls.volcano.pdf
一步法腳本可在https://ke.qq.com/course/301387?tuin=20cd7788或點(diǎn)擊閱讀原文獲得揪惦。
DESeq2 差異基因鑒定分步法
安裝包 DESeq2
安裝方法如下
source("https://bioconductor.org/biocLite.R")
biocLite('BiocInstaller')
biocLite("DESeq2")
install.packages(c("gplots", "amap", "ggplot2"))
A distributional assumption is needed because we want to estimate the
probability of extreme events (large fold change just appearing by
chance) from limited replicates. The negative binomial (a.k.a.
Gamma-Poisson) is a good choice for RNA-seq experiments because
- The observed data at gene level is inherently counts or estimated
counts of fragments for each feature and - The spread of values among biological replicates is more than given
by a simpler, one parameter distribution, the Poisson; and it seems
to be captured by the NB sufficiently well
加載包
library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colMeans,
## colnames, colSums, do.call, duplicated, eval, evalq, Filter,
## Find, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, Map, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
##
## apply
library("RColorBrewer")
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:IRanges':
##
## space
## The following object is masked from 'package:S4Vectors':
##
## space
## The following object is masked from 'package:stats':
##
## lowess
library("amap")
library("ggplot2")
library("BiocParallel")
# 多線程加速, 使用4個(gè)核
# 如果你電腦性能強(qiáng)大遍搞,可以把這個(gè)數(shù)變大
register(MulticoreParam(4))
讀入數(shù)據(jù)
# 注意文件的位置,程序自己不知道文件在什么地方
# 如果只給文件名器腋,不給路徑溪猿,程序只會(huì)在當(dāng)前目錄查找文件
# 若找不到钩杰,則會(huì)報(bào)錯(cuò)
# 可以使用下面命令設(shè)置工作目錄到文件所在目錄
# 或提供文件全路徑
# setwd("~/data")
data <- read.table("ehbio_trans.Count_matrix.xls", header=T, row.names=1, com='', quote='',
check.names=F, sep="\t")
# 撇掉在多于兩個(gè)樣本中count<1的值,如果樣本數(shù)多诊县,這個(gè)數(shù)值可以適當(dāng)增加
# 排除極低表達(dá)基因的干擾
data <- data[rowSums(data)>2,]
head(data)
## untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232 13 25 23 24
## ENSG00000278267 0 5 3 4
## ENSG00000268903 0 2 0 0
## ENSG00000269981 0 3 0 1
## ENSG00000241860 3 11 1 5
## ENSG00000279457 46 90 73 49
## trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232 12 12 22 22
## ENSG00000278267 2 4 3 1
## ENSG00000268903 0 2 0 3
## ENSG00000269981 0 2 0 0
## ENSG00000241860 3 2 0 2
## ENSG00000279457 52 46 89 31
# 讀入分組信息
sample <- read.table("sampleFile", header=T, row.names=1, com='',
quote='', check.names=F, sep="\t", colClasses="factor")
sample <- sample[match(colnames(data), rownames(sample)),, drop=F]
sample_rowname <- rownames(sample)
# 下面的可以忽略讲弄,如果沒(méi)遇到錯(cuò)誤不需要執(zhí)行
# 目的是做因子轉(zhuǎn)換
sample <- data.frame(lapply(sample, function(x) factor(x, levels=unique(x))))
rownames(sample) <- sample_rowname
sample
## conditions
## untrt_N61311 untrt
## untrt_N052611 untrt
## untrt_N080611 untrt
## untrt_N061011 untrt
## trt_N61311 trt
## trt_N052611 trt
## trt_N080611 trt
## trt_N061011 trt
產(chǎn)生DESeq數(shù)據(jù)集
DESeq2包采用DESeqDataSet
存儲(chǔ)原始的read count
和中間計(jì)算的統(tǒng)計(jì)量。
每個(gè)DESeqDataSet
對(duì)象都要有一個(gè)實(shí)驗(yàn)設(shè)計(jì)formula依痊,用于對(duì)數(shù)據(jù)進(jìn)行分組避除,以便計(jì)算表達(dá)值的離散度和估計(jì)表達(dá)倍數(shù)差異,通常格式為~ batch + conditions
(為了方便后續(xù)計(jì)算胸嘁,最為關(guān)注的分組信息放在最后一位)瓶摆。
countData
: 表達(dá)矩陣
colData
: 樣品分組信息表
design
: 實(shí)驗(yàn)設(shè)計(jì)信息,conditions
必須是colData
中的一列
ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
colData = sample, design= ~ conditions)
dds <- DESeq(ddsFullCountTable)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
獲取標(biāo)準(zhǔn)化后的數(shù)據(jù)
# ?counts查看此函數(shù)功能
# normalized=T, 返回標(biāo)準(zhǔn)化的數(shù)據(jù)
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)
## untrt_N61311 untrt_N052611 untrt_N080611 untrt_N061011
## ENSG00000227232 12.730963 21.179286 19.4979982 25.994727
## ENSG00000278267 0.000000 4.235857 2.5432172 4.332454
## ENSG00000268903 0.000000 1.694343 0.0000000 0.000000
## ENSG00000269981 0.000000 2.541514 0.0000000 1.083114
## ENSG00000241860 2.937915 9.318886 0.8477391 5.415568
## ENSG00000279457 45.048024 76.245430 61.8849507 53.072567
## trt_N61311 trt_N052611 trt_N080611 trt_N061011
## ENSG00000227232 13.423907 17.885811 15.750664 23.250144
## ENSG00000278267 2.237318 5.961937 2.147818 1.056825
## ENSG00000268903 0.000000 2.980969 0.000000 3.170474
## ENSG00000269981 0.000000 2.980969 0.000000 0.000000
## ENSG00000241860 3.355977 2.980969 0.000000 2.113649
## ENSG00000279457 58.170264 68.562276 63.718596 32.761567
根據(jù)基因在不同的樣本中表達(dá)變化的差異程度mad值
對(duì)數(shù)據(jù)排序性宏,差異越大的基因排位越前群井。
normalized_counts_mad <- apply(normalized_counts, 1, mad)
normalized_counts <- normalized_counts[order(normalized_counts_mad, decreasing=T), ]
# 標(biāo)準(zhǔn)化后的數(shù)據(jù)輸出
write.table(normalized_counts, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls",
quote=F, sep="\t", row.names=T, col.names=T)
# 只在Linux下能運(yùn)行,目的是填補(bǔ)表格左上角內(nèi)容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.xls"))
# log轉(zhuǎn)換后的結(jié)果
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- rlogMat[order(normalized_counts_mad, decreasing=T), ]
write.table(rlogMat, file="ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls",
quote=F, sep="\t", row.names=T, col.names=T)
# 只在Linux下能運(yùn)行毫胜,目的是填補(bǔ)表格左上角內(nèi)容
system(paste("sed -i '1 s/^/ID\t/'", "ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.xls"))
樣品層級(jí)聚類分析书斜,判斷樣品的相似性和組間組內(nèi)差異
# 生成顏色
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
# 計(jì)算相關(guān)性pearson correlation
pearson_cor <- as.matrix(cor(rlogMat, method="pearson"))
# 層級(jí)聚類
hc <- hcluster(t(rlogMat), method="pearson")
# 熱圖繪制
## 在命令行下運(yùn)行時(shí),需要去除下面開(kāi)頭的#號(hào)酵使,把輸出的圖保存到文件中
## 輸出到文件時(shí)荐吉,dev.off()命令是關(guān)閉輸出,從而完成圖片的寫(xiě)入口渔。如果不做這一步样屠,則圖片則不能寫(xiě)入,從而不能打開(kāi)
## 如果在Rstudio或其它可視化界面時(shí)搓劫,可以直接把圖輸出到屏幕
#pdf("ehbio_trans.Count_matrix.xls.DESeq2.normalized.rlog.pearson.pdf", pointsize=10)
heatmap.2(pearson_cor, Rowv=as.dendrogram(hc), symm=T, trace="none",
col=hmcol, margins=c(11,11), main="The pearson correlation of each
sample")
#dev.off()
樣品PCA分析
pca_data <- plotPCA(rld, intgroup=c("conditions"), returnData=T, ntop=5000)
差異基因分析
# 定義變量的好處是下面的代碼就獨(dú)立了
# 如果想比較不同的組瞧哟,只需在這修改即可混巧,后面代碼都可以不動(dòng)
sampleA = "untrt"
sampleB = "trt"
results
函數(shù)提取差異基因分析結(jié)果枪向,包含log2 fold changes
,
p values
和adjusted p values
.
constrast
用于指定比較的兩組的信息,輸出的log2FoldChange為log2(SampleA/SampleB)
咧党。
contrastV <- c("conditions", sampleA, sampleB)
res <- results(dds, contrast=contrastV)
res
## log2 fold change (MLE): conditions untrt vs trt
## Wald test p-value: conditions untrt vs trt
## DataFrame with 26280 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000227232 18.7141876 0.17695376 0.3809406 0.46451794 0.6422767
## ENSG00000278267 2.8144283 0.02802497 1.0249422 0.02734297 0.9781862
## ENSG00000268903 0.9807232 -1.71669654 2.3186888 -0.74037385 0.4590732
## ENSG00000269981 0.8256996 0.59077960 2.4538911 0.24075217 0.8097472
## ENSG00000241860 3.3713378 1.21773503 1.0789466 1.12863330 0.2590526
## ... ... ... ... ... ...
## 29256 0.3507226 1.8471273 3.484683 0.53007036 0.5960631
## 36308 0.9279831 0.1394489 1.598343 0.08724589 0.9304761
## 8414 0.7459738 -0.2678522 2.053434 -0.13044109 0.8962175
## 28837 1.2626343 0.2689443 1.405958 0.19128899 0.8482992
## 35320 1.0041030 0.5493562 1.717832 0.31979625 0.7491228
## padj
## <numeric>
## ENSG00000227232 0.8538515
## ENSG00000278267 NA
## ENSG00000268903 NA
## ENSG00000269981 NA
## ENSG00000241860 NA
## ... ...
## 29256 NA
## 36308 NA
## 8414 NA
## 28837 NA
## 35320 NA
給DESeq2
的原始輸出結(jié)果增加樣品平均表達(dá)信息秘蛔,使得結(jié)果更容易理解和解析。
# 獲得第一組數(shù)據(jù)均值
baseA <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleA]
if (is.vector(baseA)){
baseMeanA <- as.data.frame(baseA)
} else {
baseMeanA <- as.data.frame(rowMeans(baseA))
}
colnames(baseMeanA) <- sampleA
head(baseMeanA)
## untrt
## ENSG00000227232 19.8507436
## ENSG00000278267 2.7778822
## ENSG00000268903 0.4235857
## ENSG00000269981 0.9061570
## ENSG00000241860 4.6300269
## ENSG00000279457 59.0627429
# 獲得第二組數(shù)據(jù)均值
baseB <- counts(dds, normalized=TRUE)[, colData(dds)$conditions == sampleB]
if (is.vector(baseB)){
baseMeanB <- as.data.frame(baseB)
} else {
baseMeanB <- as.data.frame(rowMeans(baseB))
}
colnames(baseMeanB) <- sampleB
head(baseMeanB)
## trt
## ENSG00000227232 17.5776316
## ENSG00000278267 2.8509744
## ENSG00000268903 1.5378607
## ENSG00000269981 0.7452421
## ENSG00000241860 2.1126487
## ENSG00000279457 55.8031756
# 結(jié)果組合
res <- cbind(baseMeanA, baseMeanB, as.data.frame(res))
head(res)
## untrt trt baseMean log2FoldChange lfcSE
## ENSG00000227232 19.8507436 17.5776316 18.7141876 0.17695376 0.3809406
## ENSG00000278267 2.7778822 2.8509744 2.8144283 0.02802497 1.0249422
## ENSG00000268903 0.4235857 1.5378607 0.9807232 -1.71669654 2.3186888
## ENSG00000269981 0.9061570 0.7452421 0.8256996 0.59077960 2.4538911
## ENSG00000241860 4.6300269 2.1126487 3.3713378 1.21773503 1.0789466
## ENSG00000279457 59.0627429 55.8031756 57.4329592 0.08926886 0.2929300
## stat pvalue padj
## ENSG00000227232 0.46451794 0.6422767 0.8538515
## ENSG00000278267 0.02734297 0.9781862 NA
## ENSG00000268903 -0.74037385 0.4590732 NA
## ENSG00000269981 0.24075217 0.8097472 NA
## ENSG00000241860 1.12863330 0.2590526 NA
## ENSG00000279457 0.30474470 0.7605606 0.9130580
# 增加ID信息
res <- cbind(ID=rownames(res), as.data.frame(res))
res$baseMean <- rowMeans(cbind(baseA, baseB))
# 校正后p-value為NA的復(fù)制為1
res$padj[is.na(res$padj)] <- 1
# 按pvalue排序, 把差異大的基因放前面
res <- res[order(res$pvalue),]
head(res)
## ID untrt trt baseMean
## ENSG00000152583 ENSG00000152583 77.8512 1900.412 989.1314
## ENSG00000148175 ENSG00000148175 7233.0010 19483.552 13358.2766
## ENSG00000179094 ENSG00000179094 151.7491 1380.881 766.3151
## ENSG00000134686 ENSG00000134686 1554.0390 4082.440 2818.2395
## ENSG00000125148 ENSG00000125148 1298.0392 5958.946 3628.4926
## ENSG00000120129 ENSG00000120129 773.9769 5975.522 3374.7494
## log2FoldChange lfcSE stat pvalue
## ENSG00000152583 -4.608311 0.21090819 -21.84984 7.800059e-106
## ENSG00000148175 -1.429585 0.08639329 -16.54741 1.671380e-61
## ENSG00000179094 -3.184674 0.20065896 -15.87108 1.005030e-56
## ENSG00000134686 -1.392749 0.09190323 -15.15451 7.073605e-52
## ENSG00000125148 -2.199191 0.14796048 -14.86337 5.698869e-50
## ENSG00000120129 -2.948122 0.19931242 -14.79146 1.663007e-49
## padj
## ENSG00000152583 1.331002e-101
## ENSG00000148175 1.426021e-57
## ENSG00000179094 5.716612e-53
## ENSG00000134686 3.017600e-48
## ENSG00000125148 1.944910e-46
## ENSG00000120129 4.729591e-46
整體分析結(jié)果輸出到文件
comp314 <- paste(sampleA, "_vs_", sampleB, sep=".")
# 生成文件名
file_base <- paste("ehbio_trans.Count_matrix.xls.DESeq2", comp314, sep=".")
file_base1 <- paste(file_base, "results.xls", sep=".")
write.table(as.data.frame(res), file=file_base1, sep="\t", quote=F, row.names=F)
提取差異表達(dá)基因
# 差異基因篩選傍衡,padj<0.1
res_de <- subset(res, res$padj<0.1, select=c('ID', sampleA,
sampleB, 'log2FoldChange', 'padj'))
# foldchang > 1
res_de_up <- subset(res_de, res_de$log2FoldChange>=1)
file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA,"_higherThan_",sampleB,
'xls', sep=".")
write.table(as.data.frame(res_de_up), file=file, sep="\t", quote=F, row.names=F)
res_de_dw <- subset(res_de, res_de$log2FoldChange<=(-1)*1)
file <- paste("ehbio_trans.Count_matrix.xls.DESeq2",sampleA, "_lowerThan_", sampleB,
'xls', sep=".")
write.table(as.data.frame(res_de_dw), file=file, sep="\t", quote=F, row.names=F)
# 差異基因ID
res_de_up_id = data.frame(ID=res_de_up$ID,
type=paste(sampleA,"_higherThan_", sampleB, sep="."))
res_de_dw_id = data.frame(ID=res_de_dw$ID,
type=paste(sampleA,"_lowerThan_", sampleB, sep="."))
de_id = rbind(res_de_up_id, res_de_dw_id)
file <- "ehbio_trans.Count_matrix.xls.DESeq2.all.DE"
write.table(as.data.frame(de_id), file=file, sep="\t", quote=F, row.names=F, col.names=F)
繪制火山圖
# 可以把前面生成的results.xls文件提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
logCounts <- log2(res$baseMean+1)
logFC <- res$log2FoldChange
FDR <- res$padj
#png(filename=paste(file_base, "Volcano.png", sep="."))
plot(logFC, -1*log10(FDR), col=ifelse(FDR<=0.01, "red", "black"),
xlab="logFC", ylab="-1*log1o(FDR)", main="Volcano plot", pch=".")
#dev.off()
差異基因熱圖
res_de_up_top20_id <- as.vector(head(res_de_up$ID,20))
res_de_dw_top20_id <- as.vector(head(res_de_dw$ID,20))
red_de_top20 <- c(res_de_up_top20_id, res_de_dw_top20_id)
red_de_top20
## [1] "ENSG00000178695" "ENSG00000196517" "ENSG00000116584"
## [4] "ENSG00000144369" "ENSG00000276600" "ENSG00000108821"
## [7] "ENSG00000162692" "ENSG00000181467" "ENSG00000145777"
## [10] "ENSG00000163491" "ENSG00000183160" "ENSG00000172986"
## [13] "ENSG00000164484" "ENSG00000131389" "ENSG00000134253"
## [16] "ENSG00000124766" "ENSG00000227051" "ENSG00000146006"
## [19] "ENSG00000112837" "ENSG00000146592" "ENSG00000152583"
## [22] "ENSG00000148175" "ENSG00000179094" "ENSG00000134686"
## [25] "ENSG00000125148" "ENSG00000120129" "ENSG00000189221"
## [28] "ENSG00000109906" "ENSG00000101347" "ENSG00000162614"
## [31] "ENSG00000096060" "ENSG00000162616" "ENSG00000166741"
## [34] "ENSG00000183044" "ENSG00000164125" "ENSG00000198624"
## [37] "ENSG00000256235" "ENSG00000077684" "ENSG00000135821"
## [40] "ENSG00000164647"
# 可以把矩陣存儲(chǔ)深员,提交到www.ehbio.com/ImageGP繪制火山圖
# 或者使用我們的s-plot https://github.com/Tong-Chen/s-plot
red_de_top20_expr <- normalized_counts[rownames(normalized_counts) %in% red_de_top20,]
library(pheatmap)
pheatmap(red_de_top20_expr, cluster_row=T, scale="row", annotation_col=sample)
關(guān)于批次效應(yīng)
見(jiàn)論壇討論 http://www.ehbio.com/Esx/d/10-deseq2。
每個(gè)DESeqDataSet
對(duì)象都要有一個(gè)實(shí)驗(yàn)設(shè)計(jì)formula蛙埂,用于對(duì)數(shù)據(jù)進(jìn)行分組倦畅,以便計(jì)算表達(dá)值的離散度和估計(jì)表達(dá)倍數(shù)差異,通常格式為~ batch + conditions
(為了方便后續(xù)計(jì)算绣的,最為關(guān)注的分組信息放在最后一位)叠赐。
如果記錄了樣本的批次信息欲账,或者其它需要抹除的信息可以定義在design
參數(shù)中,在下游回歸的分析中會(huì)根據(jù)design formula
來(lái)估計(jì)batch effect
的影響芭概,并在下游分析時(shí)減去這個(gè)影響赛不。這是處理batch effect
的推薦方式。
在模型中考慮batch effect
并沒(méi)有在數(shù)據(jù)矩陣中移除bacth effect
罢洲,如果下游處理時(shí)踢故,確實(shí)有需要可以使用limma
包的removeBatchEffect
來(lái)處理。
countData
: 表達(dá)矩陣
colData
: 樣品分組信息表
design
: 實(shí)驗(yàn)設(shè)計(jì)信息惹苗,batch
和conditions
必須是colData
中的一列
ddsFullCountTable <- DESeqDataSetFromMatrix(countData = data,
colData = sample, design= ~ batch + conditions)
dds <- DESeq(ddsFullCountTable)
rld <- rlog(dds, blind=FALSE)
rlogMat <- assay(rld)
rlogMat <- limma::removeBatchEffect(rlogMat, c(sample$batch))
Just to be clear, there's an important difference between removing a
batch effect and modelling a batch effect. Including the batch in your
design formula will model the batch effect in the regression step, which
means that the raw data are not modified (so the batch effect is not
removed), but instead the regression will estimate the size of the batch
effect and subtract it out when performing all other tests. In addition,
the model's residual degrees of freedom will be reduced appropriately to
reflect the fact that some degrees of freedom were "spent" modelling the
batch effects. This is the preferred approach for any method that is
capable of using it (this includes DESeq2). You would only remove the
batch effect (e.g. using limma's removeBatchEffect function) if you were
going to do some kind of downstream analysis that can't model the batch
effects, such as training a classifier.
批次效應(yīng)模擬
#Make some simulated data with a batch effect:
dds <- makeExampleDESeqDataSet(betaSD=1,interceptMean=10)
dds$batch <- factor(rep(c("A","B"),each=6))
#VST, remove batch effect, then plotPCA:
vsd <- vst(dds)
plotPCA(vsd, "batch")
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd, "batch")
SVA(批次未記錄時(shí)殿较,尋找潛在影響因子,并矯正)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
##calculating the variables
n.sv <- num.sv(dat,mod,method="leek") #gives 11
### using 4
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=4)
co <- lnj.corr$corrected
plPCA(co)
#http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
#http://bioconductor.org/help/workflows/rnaseqGene/#removing-hidden-batch-effects