? ? ? ? 上一篇已經(jīng)系統(tǒng)介紹了有參RNASeq上游分析赏酥,從測(cè)序數(shù)據(jù)fastq文件到最終生成表達(dá)矩陣蛉鹿。這一系列基本都是RNASeq通用常規(guī)分析斋泄,其下游差異表達(dá)及可視化則根據(jù)需求及研究目標(biāo)而有所不同豁陆。這一篇著重介紹差異表達(dá)分析及常用可視化作圖帅刀,以下進(jìn)入正題:
六菲饼、差異表達(dá)分析
? ? ? ? 差異表達(dá)分析簡(jiǎn)單來(lái)說(shuō)就是鑒定一個(gè)基因的表達(dá)肾砂,在所選兩個(gè)樣本之間有無(wú)明顯差異,所用到的是統(tǒng)計(jì)學(xué)中的假設(shè)檢驗(yàn)宏悦。差異表達(dá)分析中常用的軟件為DESeq2和edgeR镐确,其均由R語(yǔ)言所寫包吝,這兩個(gè)軟件在發(fā)表的轉(zhuǎn)錄組文章中出現(xiàn)的頻率較高,在這里我們使用R中DESeq2包來(lái)進(jìn)行差異表達(dá)分析源葫,用到的輸入文件為上一篇生成的表達(dá)矩陣(gene_count.csv)文件诗越。差異表達(dá)分析可以使用Linux上的R,也可以使用RStudio來(lái)進(jìn)行分析息堂。RStudio有Windows版本嚷狞,繪圖可以實(shí)時(shí)顯示,方便調(diào)整參數(shù)荣堰。所以感耙,本文基于Windows環(huán)境下的RStudio來(lái)做差異表達(dá)分析:
第一步:準(zhǔn)備表達(dá)矩陣結(jié)果文件(gene_count.csv)
? ? ? ? 直接將Linux服務(wù)器上運(yùn)行的結(jié)果下載到本地即可,通過(guò)Xftp傳輸將gene_count.csv下載到桌面DiffAnalysis文件夾下持隧;
第二步:打開(kāi)RStudio即硼,安裝及加載需要的R包
? ? ? ? 首先安裝需要的R包(DESeq2),R包的安裝在上一篇有所提及屡拨。這里著重講一下只酥,包是R函數(shù)、數(shù)據(jù)呀狼、預(yù)編譯代碼以一種定義完善的格式組成的集合裂允,而存儲(chǔ)包的目錄則稱為庫(kù)(library)。函數(shù)".libPaths()"能夠顯示庫(kù)所在的位置哥艇,函數(shù)"library()"則可以顯示庫(kù)中有哪些包绝编。其中R中自帶了一系列默認(rèn)包(包括base、datasets貌踏、utils十饥、grDevices、graphics祖乳、stats以及methods)逗堵,它們提供了種類繁多的默認(rèn)函數(shù)和數(shù)據(jù)集。而其他包則需要通過(guò)下載來(lái)進(jìn)行安裝眷昆,安裝好后必須被載入才能使用蜒秤。第一次安裝一個(gè)包,需要選擇CRAN鏡像站點(diǎn)亚斋,選擇其中一個(gè)鏡像站點(diǎn)后作媚,使用命令install.packages()即可進(jìn)行下載和安裝,例如:install.packages(“gclus”)帅刊,安裝名為gclus的R包纸泡。一個(gè)包僅需安裝一次,但R包經(jīng)常會(huì)被作者更新厚掷,使用update.packages()即可更新已經(jīng)安裝的包弟灼。
? ? ? ? 但是install.packages()的R包安裝方式只限于發(fā)布在CRAN鏡像站點(diǎn)上的R包安裝,而我們常用的生物統(tǒng)計(jì)或生物信息分析類的包均是發(fā)布在Bioconductor上面冒黑,如我們即將用到的DESeq2包田绑,其Bioconductor上R包的安裝方式與CRAN鏡像站點(diǎn)上不同,當(dāng)然這里面又分兩種情況抡爹,取決于你的R軟件版本掩驱,如果R版本較新(R > 3.5.0),使用BiocManager::install("DESeq2")命令安裝冬竟,如下圖所示:
? ? ? ? 如果R版本較低(R < 3.5.0)欧穴,使用biocLite("DESeq2")命令安裝,如下圖所示:
? ? ? ? 這里建議使用較新版本的R泵殴,因?yàn)槟壳昂芏嘈碌纳镄畔⒎治龅腞包都是基于R3.6或更高版本寫的涮帘,R版本較低的話會(huì)用不了。本文當(dāng)前的R環(huán)境是R3.6.3笑诅,首先安裝DESeq2包调缨,其中DESeq2依賴的R包較多,包括(S4Vectors吆你、stats4弦叶、BiocGenerics、parallel妇多、IRanges伤哺、GenomicRanges、GenomeInfoDb者祖、Biobase立莉、DelayedArray、matrixStats以及BiocParallel)七问,上述包需要全部安裝并加載桃序。加載DESeq2包:library(DESeq2)
第三步:載入文件并矩陣化
? ? ? ? 首先設(shè)置工作目錄,引號(hào)里為桌面DiffAnalysis文件夾的路徑烂瘫。然后讀取“gene_count.csv”文件并進(jìn)行矩陣化媒熊;
setwd("C:/Users/Desktop/DiffAnalysis/")
counts <- read.csv("gene_count.csv", check.names = F, sep = "\t", row.names = 1, header = T)
Count <- as.matrix(counts)
第四步:Use DESeq2 to identify DE genes
Count_condition <- factor(c(rep("Sample1-", 3), rep("Sample2-", 3)))
coldata <- data.frame(row.names=colnames(Count), Count_condition)
dds <- DESeqDataSetFromMatrix(countData=Count, colData=coldata, design=~Count_condition)
dds <- DESeq(dds)
res <- results(dds)
### Output DE result
resOrdered <- res[order(res$padj), ]
resdata <- merge(as.data.frame(resOrdered), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
# 篩選過(guò)濾
resdata_filter <- subset(resdata,resdata$padj<0.05 & abs(resdata$log2FoldChange) >1)
write.table(resdata_filter, file="diffexpr_Sample1_Sample2_0.05.txt",sep="\t",row.names=F)
write.table(resdata,?file=“diffexpr_Sample1_Sample2_results.all.txt"),quote=F,sep="\t",row.names=F)?
最終結(jié)果存放于diffexpr_Sample1_Sample2_results.all.txt文件中。
七坟比、可視化
1芦鳍、輸出MAplot
準(zhǔn)備sample.info.txt及comp.info.txt文件:
colData <- read.table("sample.info.txt", header = T)
Pairindex <- read.table("comp.info.txt", sep = "\t", header = T)
for(i in 1:nrow(Pairindex)){
? ?j<-as.vector(unlist(Pairindex[i,][1]))
? ?k<-as.vector(unlist(Pairindex[i,][2]))
? ?print(paste(j,k,sep=" vs "))
? ?sample_1 = j
? ?sample_2 = k
?}
pdf(paste0(sample_1,"_vs_",sample_2,"_MAplot.pdf"))
plotMA(res, main=paste0(sample_1,"_vs_",sample_2,("_MAplot"), ylim=c(-10,10)))
dev.off()
輸出結(jié)果文件“Leaves_vs_Fruits_MAplot.pdf”
2、輸出pheatmap(查看樣本相關(guān)系數(shù))
需安裝pheatmap包葛账,BiocManager::install("pheatmap")柠衅;
ddCor <- cor(Count, method = "spearman")
pheatmap(file = "Leaves_VS_Flower_cor.pdf", ddCor, clustering_method = "average", display_numbers = T)
輸出結(jié)果:Leaves_VS_Flower_cor.pdf
3、PCA分析
vsd <- varianceStabilizingTransformation(dds)
library(ggplot2)
data <- plotPCA(vsd, intgroup=c("sample"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color = sample)) +? geom_point(size=3) + xlab(paste0("PC1: ",percentVar[1],"% variance")) + ylab(paste0("PC2: ",percentVar[2],"% variance")) + labs(title = "Sample Vs counts PCA") + theme_bw()
ggsave("sample-PCA.pdf")
4籍琳、繪制火山圖
g <- read.table("diffexpr_Sample1_Sample2_results.all.txt", header = T, row.names = 1)
g <- na.omit(g)
plot(g$log2FoldChange, g$padj)
g <- transform(g, padj = -1*log10(g$padj))
down <- g[g$log2FoldChange <= -1,]
up <- g[g$log2FoldChange >=1,]
no <- g[g$log2FoldChange > -1 & g$log2FoldChange < 1,]
plot(no$log2FoldChange, no$padj, xlim = c(-10,10), ylim = c(0,100), col = "blue", pch = 16, cex = 0.8, main = "Gene Expression", xlab = "log2FoldChange", ylab = "-log10(Qvalue)")
points(up$log2FoldChange, up$padj, col = "red", pch = 16, cex = 0.8)
points(down$log2FoldChange, down$padj, col = "green", pch = 16, cex = 0.8)
5狈醉、繪制韋恩圖
? ? ? ? ?這里繪制兩個(gè)樣本的韋恩圖,需載入VennDiagram包势誊,準(zhǔn)備差異表達(dá)基因的ID列表,gene_list_groupA.txt以及gene_list_groupB.txt谣蠢。如需繪制多個(gè)樣本的韋恩圖粟耻,則需準(zhǔn)備多個(gè)樣本的差異表達(dá)基因的ID列表。
library(VennDiagram)
listA <- read.table("gene_list_groupA.txt", header = FALSE)
listB <- read.table("gene_list_groupB.txt", header = FALSE)
A <- listA$V1
B <- listB$V1
length(A); length(B)
venn.diagram(list(A = A, B = B), fill = c("yellow", "cyan"), cex = 1.5, filename = "Sample1_VS_Sample2.png")
6眉踱、COG功能注釋圖
準(zhǔn)備COG功能注釋文件cog.class.annot.txt文件
layout(matrix(c(1,2), nrow = 1), widths = c(20,13))
layout.show(2)
par(mar = c(3,4,4,1) + 0.1)
# 設(shè)定順序
class <- c("J","A","K","L","B","D","Y","V","T","M","N","Z","W","U","O","C","G","E","F","H","I","P","Q","R","S")
# 將m中Code列轉(zhuǎn)化為因子
t <- factor(as.character(m$Code), levels = class)
# 用新的因子數(shù)據(jù)對(duì)原始數(shù)據(jù)進(jìn)行排序
m <- m[order(t),]
head(m)
barplot(m$Gene.Number, space = F, col = rainbow(25), ylab = "Number of Genes", names.arg = m$Code)
# 添加分組信息
l <- c(0, 5, 15, 23, 25)
id <- c("INFORMATION STORAGE\nAND PROCESSING", "CELLULAR PROCESSES\nAND SIGNALING", "METABOLISM", "POORLY\n CHAPACTERIZED")
id
abline(v = l[c(-1,-5)])
for (i in 2:length(l)) {
? ? ? ? text((l[i-1] + l[i])/2, max(m[,3])*1.1, id[i-1], cex = 0.8, xpd = T)
}
par(mar = c(2,0,2,1) + 0.1)
# axes 不顯示坐標(biāo)軸
plot(0,0,type = "n", xlim = c(0,1), ylim = c(0,26), bty = "n", axes = F, xlab = "", ylab = "")
for ( i in 1:length(class)){
? ? ? ? text(0,26-i+0.5, paste(m$Code[i], m$Functional.Categories[i]), pos = 4, cex = 1, pty = T)
}
# 添加標(biāo)題
title(main = "COG Function Classification")
7挤忙、GO功能注釋條形圖
準(zhǔn)備go注釋文件go.csv
go <- read.csv("Rdata/go.csv", header = T)
head(go)
library(ggplot2)
# 排序
go_sort <- go[order(go$Ontology, -go$Percentage),]
m <- go_sort[go_sort$Ontology == "Molecular function",][1:10,]
c <- go_sort[go_sort$Ontology == "Cellular component",][1:10,]
b <- go_sort[go_sort$Ontology == "Biological process",][1:10,]
# 合并向量
slimgo <- rbind(b, c, m)
# 將Term列轉(zhuǎn)化為因子
slimgo$Term = factor(slimgo$Term, levels = slimgo$Term)
colnames(slimgo)
PP <- ggplot(data = slimgo, mapping = aes(x = Term, y = Percentage, fill = Ontology))
PP + geom_bar(stat = "identity")
# 翻轉(zhuǎn)坐標(biāo)軸
PP + geom_bar(stat = "identity") + coord_flip()
PP + geom_bar(stat = "identity") + coord_flip() + scale_x_discrete(limits = rev(levels(slimgo$Term)))
# 去掉圖例
PP + geom_bar(stat = "identity") + coord_flip() + scale_x_discrete(limits = rev(levels(slimgo$Term))) + guides(fill = FALSE)
# 切換黑白背景主題
PP + geom_bar(stat = "identity") + coord_flip() + scale_x_discrete(limits = rev(levels(slimgo$Term))) + guides(fill = FALSE) + theme_bw()
8、KEGG功能注釋圖
準(zhǔn)備kegg.csv文件
library(ggplot2)
pathway <- read.csv("Rdata/kegg.csv", header = T)
head(pathway)
# 查看列名
colnames(pathway)
# 映射數(shù)據(jù)
PP <- ggplot(data = pathway, mapping = aes(x = richFactor, y = Pathway))
# 繪制點(diǎn)圖
PP + geom_point()
# 將點(diǎn)的大小映射到AvsB列
PP + geom_point(mapping = aes(size = AvsB))
# Qvalue映射到color
PP + geom_point(mapping = aes(size = AvsB, color = Qvalue))
# 修改顏色
PP + geom_point(mapping = aes(size = AvsB, color = Qvalue)) +? scale_color_gradient(low = "green", high = "red")
# 修改標(biāo)題
PP + geom_point(mapping = aes(size = AvsB, color = Qvalue)) + scale_color_gradient(low = "green", high = "red") + labs(title = "Top20 of Pathway Enrichment", x = "Rich Factor", y = "Pathway Name", color = "-log10(Qvalue)", size = "Gene Number") + theme_bw()
關(guān)于RNASeq谈喳,OK就這樣了册烈。
以上
參考文獻(xiàn):
https://github.com/mikelove/DESeq2
https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf