WGCNA分析基于兩個假設(shè):1.相似表達模式的基因可能存在共調(diào)控箍土、功能相關(guān)或處于同一通路鸽捻,2.基因網(wǎng)絡(luò)符合無尺度分布说榆∮媪ィ基于這兩點,可以將基因網(wǎng)絡(luò)根據(jù)表達相似性劃分為不同的模塊進而找出樞紐基因纵顾。
基本分析流程:
1. 首先看下樣本數(shù)據(jù)伍茄,共35個樣本,12328個基因施逾,行是樣本敷矫,列是基因:
rm(list=ls())
library(WGCNA)
library(data.table)
library(stringr)
library(openxlsx)
allowWGCNAThreads()
ALLOW_WGCNA_THREADS=4
memory.limit(size = 20000)
# 查看部分樣本數(shù)據(jù)和性狀數(shù)據(jù):
multiExpr[[1]]$data[1:5,1:5]
datTraits[1:5,1:5]
然后是樣本性狀數(shù)據(jù),行是樣本汉额,列是樣本性狀:
2. 查看樣本數(shù)據(jù)是否完整:
# 檢查數(shù)據(jù)是否正確:
exprSize = checkSets(multiExpr)
# 統(tǒng)計樣本數(shù)量和基因數(shù)量:
nGenes = exprSize$nGenes;
nSamples = exprSize$nSamples;
# 檢查所有基因和樣本的缺失值是否足夠低曹仗。
gsg = goodSamplesGenesMS(multiExpr, verbose = 3);
gsg$allOK
結(jié)果是OK,不OK得對樣本數(shù)據(jù)進行校正蠕搜。
3.選擇一個β值建立臨近矩陣根據(jù)連接度使我們的基因分布符合無尺度網(wǎng)絡(luò):
# 設(shè)定軟閾值范圍
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# 獲得各個閾值下的 R方 和平均連接度
sft = pickSoftThreshold(multiExpr[[1]]$data, powerVector = powers, verbose = 5)
# 作圖:
sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
左圖:Soft Threshold (power)表示權(quán)重怎茫,縱坐標(biāo)表示連接度k與p(k)的相關(guān)性。右圖:Soft Threshold (power)表示權(quán)重,縱坐標(biāo)表示平均連接度轨蛤。一般要求k與p(k)的相關(guān)性達到0.85時的power作為β值蜜宪,可以看出本次示例中 β=4
4.根據(jù)β值獲得臨近矩陣和拓?fù)渚仃嚕?/h5>
# 獲得臨近矩陣:
softPower <- sft$powerEstimate
adjacency = adjacency(multiExpr[[1]]$data, power = softPower);
# 將臨近矩陣轉(zhuǎn)為 Tom 矩陣
TOM = TOMsimilarity(adjacency);
# 計算基因之間的相異度
dissTOM = 1-TOM
hierTOM = hclust(as.dist(dissTOM),method="average");
5.檢驗選定的β值下記憶網(wǎng)絡(luò)是否逼近 scale free
# ADJ1_cor <- abs(WGCNA::cor( multiExpr[[1]]$data,use = "p" ))^softPower
# 基因少(<5000)的時候使用下面的代碼:
k <- as.vector(apply(ADJ1_cor,2,sum,na.rm=T))
# 基因多的時候使用下面的代碼:
k <- softConnectivity(datE=multiExpr[[1]]$data,power=softPower)
sizeGrWindow(10, 5)
par(mfrow=c(1,2))
hist(k)
scaleFreePlot(k,main="Check Scale free topology\n")
# 獲得臨近矩陣:
softPower <- sft$powerEstimate
adjacency = adjacency(multiExpr[[1]]$data, power = softPower);
# 將臨近矩陣轉(zhuǎn)為 Tom 矩陣
TOM = TOMsimilarity(adjacency);
# 計算基因之間的相異度
dissTOM = 1-TOM
hierTOM = hclust(as.dist(dissTOM),method="average");
# ADJ1_cor <- abs(WGCNA::cor( multiExpr[[1]]$data,use = "p" ))^softPower
# 基因少(<5000)的時候使用下面的代碼:
k <- as.vector(apply(ADJ1_cor,2,sum,na.rm=T))
# 基因多的時候使用下面的代碼:
k <- softConnectivity(datE=multiExpr[[1]]$data,power=softPower)
sizeGrWindow(10, 5)
par(mfrow=c(1,2))
hist(k)
scaleFreePlot(k,main="Check Scale free topology\n")
可以看出k與p(k)成負(fù)相關(guān)(相關(guān)性系數(shù)0.9),說明選擇的β值能夠建立基因無尺度網(wǎng)絡(luò)。
6.對剛才得到的拓?fù)渚仃囀褂孟喈惗萪issimilarity between genes對基因進行聚類祥山,然后使用動態(tài)剪切法對樹進行剪切成不同的模塊(模塊最小基因數(shù)為30):
# 使用相異度來聚類為gene tree(聚類樹):
geneTree = hclust(as.dist(dissTOM), method = "average");
# Plot the resulting clustering tree (dendrogram)
windows()
sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04);
# 使用動態(tài)剪切樹挖掘模塊:
minModuleSize = 30;
# 動態(tài)切割樹:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize);
table(dynamicMods)
共15個modules
7.隨機選擇400個基因畫拓?fù)渲丿B熱圖圃验,圖中行和列都表示單個基因,深黃色和紅色表示高度的拓?fù)渲丿B缝呕。
# 拓?fù)錈釄D:
nSelect = 400
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = dynamicColors[select];
# Open a graphical window
sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^softPower;
diag(plotDiss) = NA;
TOMplot(plotDiss,
selectTree,
selectColors,
main = "Network heatmap plot, selected genes")
8. 計算每個模塊的特征向量基因澳窑,為某一特定模塊第一主成分基因E。代表了該模塊內(nèi)基因表達的整體水平
MEList = moduleEigengenes(multiExpr[[1]]$data, colors = dynamicColors)
MEs = MEList$eigengenes
# 計算根據(jù)模塊特征向量基因計算模塊相異度:
MEDiss = 1-cor(MEs);
# Cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average");
# Plot the result
特征向量基因臨近熱圖:
plotEigengeneNetworks(MEs,
"Eigengene adjacency heatmap",
marHeatmap = c(3,4,2,2),
plotDendrograms = FALSE,
xLabelsAngle = 90)
畫出指定模塊表達量熱圖與特征向量基因柱狀圖岳颇,可以看出特征向量基因ME的表達與整個模塊內(nèi)基因的表達高度相關(guān):
# 畫出指定模塊表達量的熱圖:
which.module=Freq_MS_max$GS_color;
ME=mergedMEs[, paste("ME",which.module, sep="")]
par(mfrow=c(2,1), mar=c(0,4.1,4,2.05))
plotMat(t(scale(multiExpr[[1]]$data[,colorh1==which.module ]) ),
nrgcols=30,rlabels=F,rcols=which.module,
main=which.module, cex.main=2)
par(mar=c(2,2.3,0.5,0.8))
barplot(ME, col=which.module, main="", cex.main=2,
ylab="eigengene expression",xlab="array sample")
特征向量基因聚類樹狀圖照捡,紅線以下的模塊表示相關(guān)性>0.8,將被合并话侧。
plot(METree,
main = "Clustering of module eigengenes",
xlab = "",
sub = "")
# 在聚類圖中畫出剪切線
abline(h=MEDissThres, col = "red")
特征向量基因關(guān)聯(lián)圖:
9. 將相關(guān)性系數(shù)大于0.8的模塊合并掉,即相異性系數(shù)小于0.2:(本次合并掉2個模塊)
MEDissThres = 0.2
# 在聚類圖中畫出剪切線
abline(h=MEDissThres, col = "red")
# 合并模塊:
merge_modules = mergeCloseModules(multiExpr[[1]]$data, dynamicColors, cutHeight = MEDissThres, verbose = 3)
# 合并后的顏色:
mergedColors = merge_modules$colors;
# 新模塊的特征向量基因:
mergedMEs = merge_modules$newMEs;
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
10. 樣本聚類圖與樣本性狀熱圖:
# 畫出樣本聚類圖(上)與樣本性狀熱圖(下):
traitColors = numbers2colors(datTraits, signed = TRUE,centered=TRUE);
plotDendroAndColors(sampleTrees[[set]],
traitColors,
groupLabels = names(datTraits),
rowTextAlignment = "right-justified",
addTextGuide = TRUE ,
hang = 0.03,
dendroLabels = NULL, # 是否顯示樹labels
addGuide = FALSE, # 顯示虛線
guideHang = 0.05,
main = "Sample dendrogram and trait heatmap")
11. 模塊與樣本性狀相關(guān)性熱圖闯参,行表示模塊瞻鹏,列表示性狀。方塊里的值表示相關(guān)性和pvalue.
moduleTraitCor_noFP <- cor(mergedMEs, datTraits[,1:14], use = "p");
moduleTraitPvalue_noFP = corPvalueStudent(moduleTraitCor_noFP, nSamples);
textMatrix_noFP <- paste(signif(moduleTraitCor_noFP, 2), "\n(", signif(moduleTraitPvalue_noFP, 1), ")", sep = "");
par(mar = c(10, 8.5, 3, 3));
labeledHeatmap(Matrix = moduleTraitCor_noFP,
xLabels = names(datTraits[,1:14]),
yLabels = names(mergedMEs),
ySymbols = names(mergedMEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix_noFP,
setStdMargins = FALSE,
cex.text = 0.65,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
12. 根據(jù)性狀與模塊特征向量基因的相關(guān)性及pvalue來挖掘與性狀相關(guān)的模塊
cor_ADR <- signif(WGCNA::cor(datTraits,mergedMEs,use="p",method="pearson"),5)
p.values <- corPvalueStudent(cor_ADR,nSamples=nrow(datTraits))
選擇除去grey
模塊外相關(guān)系數(shù)最高pvalue最小的那個模塊(grey表示未劃分到任一模塊的基因
):
Freq_MS_max_cor <- which.max(abs(cor_ADR["Freq",-which(colnames(cor_ADR) == "MEgrey")]))
Freq_MS_max_p <- which.min(p.values["Freq",-which(colnames(p.values) == "MEgrey")])
可以看出結(jié)果都是第9個模塊, greenyellow:
13. 根據(jù)基因網(wǎng)絡(luò)顯著性鹿寨,也就是性狀與每個基因表達量相關(guān)性在各個模塊的均值作為該性狀在該模塊的顯著性新博,顯著性最大的那個模塊與該性狀最相關(guān):
GS1 <- as.numeric(WGCNA::cor(datTraits[,i],multiExpr[[1]]$data,use="p",method="pearson"))
# 顯著性是絕對值:
GeneSignificance <- abs(GS1)
# 獲得該性狀在每個模塊中的顯著性:
ModuleSignificance <- tapply(GeneSignificance,mergedColors,mean,na.rm=T)
可以看出,上面兩種方法得到的結(jié)果相同脚草,與性狀最相關(guān)的模塊是
greenyellow
.
14. 尋找與該性狀相關(guān)的樞紐基因(hub genes),首先計算基因的內(nèi)部連接度和模塊身份赫悄,內(nèi)部連接度衡量的是基因在模塊內(nèi)部的地位,而模塊身份表明基因?qū)儆谀膫€模塊馏慨。
# 計算每個基因模塊內(nèi)部連接度埂淮,也就是基因直接兩兩加權(quán)相關(guān)性。
ADJ1=abs(cor(multiExpr[[1]]$data,use="p"))^softPower
# 根據(jù)上面結(jié)果和基因所屬模塊信息獲得連接度:
# 整體連接度 kTotal写隶,模塊內(nèi)部連接度:kWithin倔撞,kOut=kTotal-kWithin, kDiff=kIn-kOut=2*kIN-kTotal
Alldegrees1=intramodularConnectivity(ADJ1, colorh1)
# 注意模塊內(nèi)基于特征向量基因連接度評估模塊內(nèi)其他基因: de ne a module eigengene-based connectivity measure for each gene as the correlation between a the gene expression and the module eigengene
# 如 brown 模塊內(nèi):kM Ebrown(i) = cor(xi, MEbrown) 慕趴, xi is the gene expression pro le of gene i and M Ebrown is the module eigengene of the brown module
# 而 module membership 與內(nèi)部連接度不同痪蝇。MM 衡量了基因在全局網(wǎng)絡(luò)中的位置。
datKME=signedKME(multiExpr[[1]]$data, datME, outputColumnName="MM.")
查看內(nèi)部連接度和 MM直接的關(guān)系冕房,以brown為例
which.color=Freq_MS_max$GS_color;
restrictGenes=colorh1==which.color
verboseScatterplot(Alldegrees1$kWithin[ restrictGenes],
(datKME[restrictGenes, paste("MM.", which.color, sep="")])^4,
col=which.color,
xlab="Intramodular Connectivity",
ylab="(Module Membership)^4")
可以看出內(nèi)部連接度與模塊身份高度相關(guān)躏啰。
我們使用3個標(biāo)準(zhǔn)來篩選樞紐基因:基因與指定模塊顯著性 > 0.2, greenyellow Module membership value > 0.8, q.weighted < 0.01
GS_spe <- as.numeric(apply(multiExpr[[1]]$data,2,function(x){
biserial.cor(x,datTraits[,trait_minP$trait[i]])
}))
GeneSignificance_spe <- abs(GS_spe)
# 基于顯著性和MM計算每個基因與 指定trait 的關(guān)聯(lián)耙册,結(jié)果包括p, q, cor, z,
NS1=networkScreening(y=allTraits[,trait_minP$trait[[i]]],
datME=datME,
datExpr=multiExpr[[1]]$data,
oddPower=3,
blockSize=1000,
minimumSampleSize=4,
addMEy=TRUE,
removeDiag=FALSE,
weightESy=0.5)
rownames(NS1) <- colnames(multiExpr[[1]]$data)
# 根據(jù) 基因與指定性狀的直接相關(guān)性(biserial.cor)给僵,模塊身份,和加權(quán)相關(guān)性 篩選基因:
FilterGenes_spe = ((GeneSignificance_spe > 0.2) & (abs(datKME[paste("MM.",Freq_MS_max$GS_color,sep="")])>0.8) & (NS1$q.Weighted < 0.01) )
table(FilterGenes_spe)
# 找到滿足上面條件的基因:
trait_hubGenes_spe <- colnames(multiExpr[[1]]$data)[FilterGenes_spe]
符合上述條件的樞紐基因有53個:
這些基因的拓?fù)渲丿B熱圖觅玻,顏色越深表示拓?fù)渲丿B度越高想际。
# hub 基因熱圖:
plotNetworkHeatmap(multiExpr[[1]]$data,
plotGenes = paste("X",trait_hubGenes_spe,sep = ""),
networkType = "unsigned",
useTOM = TRUE,
power=softPower,
main="unsigned correlations")
15. hub genes GO and KEGG analysis.
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
# GO 分析:
ego <- enrichGO(gene = trait_hubGenes_spe,
# universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
GO_BP <- as.data.frame(ego)
GO_BP$point_shape<-"0"
GO_BP$point_size<-"15"
# write.xlsx(GO_BP,"./results/392_genes_GO_BP.xlsx")
ggplot(data=GO_BP)+
geom_bar(aes(x=reorder(Description,Count),y=Count, fill=-log10(qvalue)), stat='identity') +
coord_flip() +
scale_fill_gradient(expression(-log["10"]("q value")),low="red", high = "blue") +
xlab("") +
ylab("Gene count") +
scale_y_continuous(expand=c(0, 0))+
theme_bw()+
theme(
axis.text.x=element_text(color="black",size=rel(1.5)),
axis.text.y=element_text(color="black", size=rel(1.6)),
axis.title.x = element_text(color="black", size=rel(1.6)),
legend.text=element_text(color="black",size=rel(1.0)),
legend.title = element_text(color="black",size=rel(1.1))
# legend.position=c(0,1),legend.justification=c(-1,0)
# legend.position="top",
)
KEGG 分析:
# KEGG:
kk <- enrichKEGG(gene = trait_hubGenes_spe,
organism = 'hsa',
pvalueCutoff = 0.05)
kegg_DF <- as.data.frame(kk)
15. 導(dǎo)出 hub genes 為 Cytoscape 和 visant 可以識別的格式并作圖:
# 導(dǎo)出整個模塊基因到 VisANT
modTOM <- TOM[mergedColors==Freq_MS_max$GS_color,mergedColors==Freq_MS_max$GS_color]
dimnames(modTOM) = list(colnames(multiExpr[[1]]$data)[mergedColors==Freq_MS_max$GS_color], colnames(multiExpr[[1]]$data)[mergedColors==Freq_MS_max$GS_color])
vis = exportNetworkToVisANT(modTOM,
file = paste("./WGCNA/ADR_drug_exp_new/VisANTInput-Mod-", Freq_MS_max$GS_color, ".txt", sep=""),
weighted = TRUE,
threshold = 0
# probeToGene = data.frame(annot$substanceBXH, annot$gene_symbol)
)
# 導(dǎo)出樞紐基因到 Cytoscape
hubGene_TOM <- TOM[FilterGenes_spe,FilterGenes_spe]
dimnames(hubGene_TOM) = list(colnames(multiExpr[[1]]$data)[FilterGenes_spe], colnames(multiExpr[[1]]$data)[FilterGenes_spe])
cyt = exportNetworkToCytoscape(hubGene_TOM,
edgeFile = paste("./WGCNA/ADR_drug_exp_new/CytoscapeInput-edges-", paste(Freq_MS_max$GS_color, collapse="-"), ".txt", sep=""),
nodeFile = paste("./WGCNA/ADR_drug_exp_new/CytoscapeInput-nodes-", paste(Freq_MS_max$GS_color, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = trait_hubGenes_spe,
altNodeNames = trait_hubGenes_spe,
nodeAttr = mergedColors[FilterGenes_spe]
)
更多原創(chuàng)精彩視頻敬請關(guān)注生信雜談: