一戴而、前言
1.1 概念
加權(quán)基因共表達(dá)網(wǎng)絡(luò)分析 (WGCNA, Weighted correlation network analysis)是用來描述不同樣品之間基因關(guān)聯(lián)模式的系統(tǒng)生物學(xué)方法,可以用來鑒定高度協(xié)同變化的基因集, 并根據(jù)基因集的內(nèi)連性和基因集與表型之間的關(guān)聯(lián)鑒定候補(bǔ)生物標(biāo)記基因或治療靶點(diǎn)玖姑。
該分析方法旨在尋找協(xié)同表達(dá)的基因模塊(module)莉钙,并探索基因網(wǎng)絡(luò)與關(guān)注的表型之間的關(guān)聯(lián)關(guān)系啥刻,以及網(wǎng)絡(luò)中的核心基因刺下。
適用于復(fù)雜的數(shù)據(jù)模式共郭,推薦5組(或者15個(gè)樣品)以上的數(shù)據(jù)树埠。一般可應(yīng)用的研究方向有:不同器官或組織類型發(fā)育調(diào)控糠馆、同一組織不同發(fā)育調(diào)控、非生物脅迫不同時(shí)間點(diǎn)應(yīng)答怎憋、病原菌侵染后不同時(shí)間點(diǎn)應(yīng)答又碌。
2.2 原理
從方法上來講,WGCNA分為表達(dá)量聚類分析和表型關(guān)聯(lián)兩部分绊袋,主要包括基因之間相關(guān)系數(shù)計(jì)算毕匀、基因模塊的確定、共表達(dá)網(wǎng)絡(luò)癌别、模塊與性狀關(guān)聯(lián)四個(gè)步驟皂岔。
第一步計(jì)算任意兩個(gè)基因之間的相關(guān)系數(shù)(Person Coefficient)。為了衡量?jī)蓚€(gè)基因是否具有相似表達(dá)模式展姐,一般需要設(shè)置閾值來篩選躁垛,高于閾值的則認(rèn)為是相似的。但是這樣如果將閾值設(shè)為0.8圾笨,那么很難說明0.8和0.79兩個(gè)是有顯著差別的教馆。因此,WGCNA分析時(shí)采用相關(guān)系數(shù)加權(quán)值擂达,即對(duì)基因相關(guān)系數(shù)取N次冪土铺,使得網(wǎng)絡(luò)中的基因之間的連接服從無尺度網(wǎng)絡(luò)分布(scale-freenetworks),這種算法更具生物學(xué)意義板鬓。
第二步通過基因之間的相關(guān)系數(shù)構(gòu)建分層聚類樹悲敷,聚類樹的不同分支代表不同的基因模塊,不同顏色代表不同的模塊穗熬《朴兀基于基因的加權(quán)相關(guān)系數(shù),將基因按照表達(dá)模式進(jìn)行分類唤蔗,將模式相似的基因歸為一個(gè)模塊探遵。這樣就可以將幾萬個(gè)基因通過基因表達(dá)模式被分成了幾十個(gè)模塊,是一個(gè)提取歸納信息的過程妓柜。
?共表達(dá)網(wǎng)絡(luò):定義為加權(quán)基因網(wǎng)絡(luò)箱季。點(diǎn)代表基因,邊代表基因表達(dá)相關(guān)性棍掐。加權(quán)是指對(duì)相關(guān)性值進(jìn)行冥次運(yùn)算 (冥次的值也就是軟閾值 (power, pickSoftThreshold這個(gè)函數(shù)所做的就是確定合適的power))藏雏。無向網(wǎng)絡(luò)的邊屬性計(jì)算方式為 abs(cor(genex, geney)) ^ power;有向網(wǎng)絡(luò)的邊屬性計(jì)算方式為 (1+cor(genex, geney)/2) ^ power; sign hybrid的邊屬性計(jì)算方式為cor(genex, geney)^power if cor>0 else 0
?Module(模塊):高度內(nèi)連的基因集作煌。在無向網(wǎng)絡(luò)中掘殴,模塊內(nèi)是高度相關(guān)的基因赚瘦。在有向網(wǎng)絡(luò)中,模塊內(nèi)是高度正相關(guān)的基因奏寨。把基因聚類成模塊后起意,可以對(duì)每個(gè)模塊進(jìn)行三個(gè)層次的分析:1. 功能富集分析查看其功能特征是否與研究目的相符;2. 模塊與性狀進(jìn)行關(guān)聯(lián)分析病瞳,找出與關(guān)注性狀相關(guān)度最高的模塊揽咕;3. 模塊與樣本進(jìn)行關(guān)聯(lián)分析,找到樣品特異高表達(dá)的模塊套菜。
?Eigengene (eigen + gene):基因和樣本構(gòu)成的矩陣亲善,https://en.wiktionary.org/wiki/eigengene
?Connectivity (連接度):類似于網(wǎng)絡(luò)中 "度" (degree)的概念。每個(gè)基因的連接度是與其相連的基因的邊屬性之和逗柴。
?Module eigengene E: 給定模型的第一主成分蛹头,代表整個(gè)模型的基因表達(dá)譜。這個(gè)是個(gè)很巧妙的梳理嚎于,我們之前講過PCA分析的降維作用掘而,之前主要是拿來做可視化,現(xiàn)在用到這個(gè)地方于购,很好的用一個(gè)向量代替了一個(gè)矩陣,方便后期計(jì)算知染。(降維除了PCA肋僧,還可以看看tSNE)
?Intramodular connectivity: 給定基因與給定模型內(nèi)其他基因的關(guān)聯(lián)度,判斷基因所屬關(guān)系控淡。
?Module membership: 給定基因表達(dá)譜與給定模型的eigengene的相關(guān)性嫌吠。
?Hub gene: 關(guān)鍵基因 (連接度最多或連接多個(gè)模塊的基因)。
?Adjacency matrix (鄰接矩陣):基因和基因之間的加權(quán)相關(guān)性值構(gòu)成的矩陣掺炭。
?TOM (Topological overlap matrix):把鄰接矩陣轉(zhuǎn)換為拓?fù)渲丿B矩陣辫诅,以降低噪音和假相關(guān),獲得的新距離矩陣涧狮,這個(gè)信息可拿來構(gòu)建網(wǎng)絡(luò)或繪制TOM圖炕矮。
以上解釋來自:WGCNA分析,簡(jiǎn)單全面的最新教程者冤;http://www.reibang.com/p/e9cc3f43441d
分析案例
--
NCBI搜索結(jié)果:WGCNA - Search Results - PubMed (nih.gov)
WGCNA分析肤视,如果你的數(shù)據(jù)樣本較大,建議在服務(wù)器中進(jìn)行運(yùn)行I娣恪邢滑!
二、WGCNA 代碼教程
2.1 加載所需的R包
加載WGCNAR包
#install.packages("WGCNA")
#BiocManager::install('WGCNA')
library(WGCNA)
進(jìn)行基礎(chǔ)設(shè)置愿汰,(此步驟可以省略)
options(stringsAsFactors = FALSE)
enableWGCNAThreads() ## 打開多線程
2.2 加載數(shù)據(jù)矩陣
WGCNA.fpkm = read.table("WGCNA.inputdata.txt",header=T,
comment.char = "",
check.names=F)
數(shù)據(jù)矩陣格式
> WGCNA.fpkm[1:10,1:10]
sample MT_0_1 MT_0_2 MT_0_3 MT_3_1 MT_3_2 MT_3_3 MT_6_1 MT_6_2 MT_6_3
1 gene10 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.225467 0.000000 0.116728
2 gene12 0 0.000000 0.000000 0.000000 0.140796 0.000000 0.051667 0.076227 0.109181
3 gene13 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.090672 0.000000 0.000000
4 gene14 0 0.000000 0.247270 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 gene16 0 0.000000 0.213587 0.051469 0.433966 0.786561 0.000000 0.246451 0.972179
6 gene19 0 0.113876 0.060553 0.000000 0.331847 0.000000 0.099846 0.224528 0.191205
7 gene20 0 0.000000 0.000000 0.183355 0.531899 0.383033 0.000000 2.460091 2.441143
8 gene21 0 0.470667 0.000000 1.016470 0.583894 0.000000 1.074058 2.020643 1.007870
9 gene24 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
10 gene25 0 0.054502 0.026962 0.026076 0.025938 0.000000 0.028589 0.023683 0.000000
查看數(shù)據(jù)名稱
names(WGCNA.fpkm)
[1] "sample" "MT_0_1" "MT_0_2" "MT_0_3" "MT_3_1" "MT_3_2" "MT_3_3" "MT_6_1" "MT_6_2" "MT_6_3" "MT_12_1" "MT_12_2"
[13] "MT_12_3" "HG_0_1" "HG_0_2" "HG_0_3" "HG_3_1" "HG_3_2" "HG_3_3" "HG_6_1" "HG_6_2" "HG_6_3" "HG_12_1" "HG_12_2"
[25] "HG_12_3"
對(duì)數(shù)據(jù)進(jìn)行提取
names(WGCNA.fpkm)
datExpr0 = as.data.frame(t(WGCNA.fpkm[,-1]))
names(datExpr0) = WGCNA.fpkm$sample;
rownames(datExpr0) = names(WGCNA.fpkm[,-1])
check missing value
gsg = goodSamplesGenes(datExpr0, verbose = 3)
gsg$allOK
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")))
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")))
# Remove the offending genes and samples from the data:
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}
filter
meanFPKM= 0.5 ####--過濾標(biāo)準(zhǔn)困后,可以修改
n=nrow(datExpr0)
datExpr0[n+1,]=apply(datExpr0[c(1:nrow(datExpr0)),],2,mean)
datExpr0=datExpr0[1:n,datExpr0[n+1,] > meanFPKM]
# for meanFpkm in row n+1 and it must be above what you set--select meanFpkm>opt$meanFpkm(by rp)
filtered_fpkm=t(datExpr0)
filtered_fpkm=data.frame(rownames(filtered_fpkm),filtered_fpkm)
names(filtered_fpkm)[1]="sample"
head(filtered_fpkm)
write.table(filtered_fpkm, file="mRNA.symbol.uniq.filter.txt",
row.names=F, col.names=T,quote=FALSE,sep="\t")
2.3 對(duì)input data進(jìn)行聚類
#############Sample cluster###########
sampleTree = hclust(dist(datExpr0), method = "average")
pdf(file = "1.sampleClustering.pdf", width = 15, height = 8)
par(cex = 0.6)
par(mar = c(0,6,6,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 2,
cex.axis = 1.5, cex.main = 2)
### Plot a line to show the cut
#abline(h = 180, col = "red")##剪切高度不確定乐纸,故無紅線
dev.off()
如果數(shù)據(jù)中有個(gè)別數(shù)據(jù)與其他數(shù)據(jù)有較大的差異性,為了讓后續(xù)分析更精準(zhǔn)摇予,可以將其進(jìn)行適當(dāng)?shù)膭h除即可锯仪。如上圖所示.
2.4 加載性狀數(shù)據(jù)
allTraits = read.table("type.txt",row.names=1,header=T,comment.char = "",check.names=F)
dim(allTraits)
names(allTraits)
head(allTraits)
> head(allTraits)
Cold_3h Cold_6h Cold_12
MT_0_1 0 0 0
MT_0_2 0 0 0
MT_0_3 0 0 0
MT_3_1 3 0 0
MT_3_2 3 0 0
MT_3_3 3 0 0
表達(dá)量與性狀數(shù)據(jù)進(jìn)行匹配
fpkmSamples = rownames(datExpr0)
traitSamples =rownames(allTraits)
traitRows = match(fpkmSamples, traitSamples)
datTraits = allTraits[traitRows,]
rownames(datTraits)
collectGarbage()
pdf(file="2.Sample_dendrogram_and_trait_heatmap.pdf",width=20,height=12)
plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits),
main = "Sample dendrogram and trait heatmap",cex.colorLabels = 1.5, cex.dendroLabels = 1, cex.rowText = 2)
dev.off()
2.5 選擇最優(yōu)的sftpower值
# 選擇最優(yōu)的sftpower值
# softPower = sft$powerEstimate
# NA
softPower = 9
經(jīng)驗(yàn)power (無滿足條件的power時(shí)選用)
# 無向網(wǎng)絡(luò)在power小于15或有向網(wǎng)絡(luò)power小于30內(nèi),沒有一個(gè)power值可以使
# 無標(biāo)度網(wǎng)絡(luò)圖譜結(jié)構(gòu)R^2達(dá)到0.8趾盐,平均連接度較高如在100以上庶喜,可能是由于
# 部分樣品與其他樣品差別太大。這可能由批次效應(yīng)救鲤、樣品異質(zhì)性或?qū)嶒?yàn)條件對(duì)
# 表達(dá)影響太大等造成久窟。可以通過繪制樣品聚類查看分組信息和有無異常樣品本缠。
# 如果這確實(shí)是由有意義的生物變化 引起的斥扛,也可以使用下面的經(jīng)驗(yàn)power值。
if (is.na(power)){
power = ifelse(nSamples<20, ifelse(type == "unsigned", 9, 18),
ifelse(nSamples<30, ifelse(type == "unsigned", 8, 16),
ifelse(nSamples<40, ifelse(type == "unsigned", 7, 14),
ifelse(type == "unsigned", 6, 12))
)
)
}
2.6 網(wǎng)絡(luò)構(gòu)建
adjacency = adjacency(datExpr0, power = softPower)
TOM = TOMsimilarity(adjacency);
dissTOM = 1-TOM
# Call the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "average");
# Plot the resulting clustering tree (dendrogram)
#sizeGrWindow(12,9)
pdf(file="4_Gene clustering on TOM-based dissimilarity.pdf",width=24,height=18)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE, hang = 0.04)
dev.off()
層級(jí)聚類樹展示各個(gè)模塊
# 選擇模塊的數(shù)量丹锹,可選大小
minModuleSize = 30
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize);
table(dynamicMods)
# Convert numeric lables into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
#sizeGrWindow(8,6)
pdf(file="5_Dynamic Tree Cut.pdf",width=8,height=6)
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
dev.off()
模塊間進(jìn)行矯正合并
計(jì)算模塊合并的高度
# Calculate eigengenes
MEList = moduleEigengenes(datExpr0, colors = dynamicColors)
MEs = MEList$eigengenes
# Calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs);
# Cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average")
# Plot the result
#sizeGrWindow(7, 6)
pdf(file="6_Clustering of module eigengenes.pdf",width=7,height=6)
plot(METree, main = "Clustering of module eigengenes",
xlab = "", sub = "")
MEDissThres = 0.4######剪切高度可修改
# Plot the cut line into the dendrogram
abline(h=MEDissThres, col = "red")
dev.off()
重新繪制合并后的模塊層次聚類圖
# Call an automatic merging function
merge = mergeCloseModules(datExpr0, dynamicColors, cutHeight = MEDissThres, verbose = 3)
# The merged module colors
mergedColors = merge$colors
# Eigengenes of the new merged modules:
mergedMEs = merge$newMEs
table(mergedColors)
#sizeGrWindow(12, 9)
pdf(file="7_merged dynamic.pdf", width = 9, height = 6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
2.7 各個(gè)模塊與數(shù)量性狀間的相關(guān)性計(jì)算
## 模塊與數(shù)量性狀間的相關(guān)性
#
moduleColors = mergedColors
# Construct numerical labels corresponding to the colors
colorOrder = c("grey", standardColors(50))
moduleLabels = match(moduleColors, colorOrder)-1
MEs = mergedMEs
nGenes = ncol(datExpr0)
nSamples = nrow(datExpr0)
moduleTraitCor = cor(MEs, datTraits, use = "p")
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
pdf(file="8_Module-trait relationships.pdf",width=10,height=10)
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "")
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3))
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
dev.off()
2.8 可視化基因網(wǎng)絡(luò) (TOM plot)
數(shù)據(jù)處理
nGenes = ncol(datExpr0)
nSamples = nrow(datExpr0)
nSelect = 400
# For reproducibility, we set the random seed
set.seed(10)
select = sample(nGenes, size = nSelect)
selectTOM = dissTOM[select, select]
#
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select]
plotDiss = selectTOM^7
diag(plotDiss) = NA
library("gplots")
pdf(file="Network heatmap plot_selected genes.pdf",width=9, height=9)
mycol = colorpanel(250,'red','orange','lemonchiffon')
TOMplot(plotDiss, selectTree, selectColors, col=mycol ,main = "Network heatmap plot, selected genes")
dev.off()
2.9 繪制模塊之間相關(guān)性圖
pdf(file="Eigengene dendrogram and Eigengene adjacency heatmap.pdf", width=5, height=7.5)
par(cex = 0.9)
plotEigengeneNetworks(MEs, "", marDendro = c(0,4,1,2), marHeatmap = c(3,4,1,2), cex.lab = 0.8, xLabelsAngle= 90)
dev.off()
pdf(file="Eigengene dendrogram_2.pdf",width=6, height=6)
par(cex = 1.0)
plotEigengeneNetworks(MEs, "Eigengene dendrogram", marDendro = c(0,4,2,0), plotHeatmaps = FALSE)
dev.off()
pdf(file="Eigengene adjacency heatmap_2.pdf",width=6, height=6)
par(cex = 1.0)
plotEigengeneNetworks(MEs, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
參考:
1.一文看懂WGCNA 分析(2019更新版)
2.WGCNA分析稀颁,簡(jiǎn)單全面的最新教程
“小杜的生信筆記”公眾號(hào)、知乎楣黍、簡(jiǎn)書平臺(tái)匾灶,主要發(fā)表或收錄生物信息學(xué)的教程,以及基于R的分析和可視化(包括數(shù)據(jù)分析租漂,圖形繪制等)阶女;分享感興趣的文獻(xiàn)和學(xué)習(xí)資料!!