? ? ? ?近年來(lái),很多SCI高分文章中都使用了WGCNA萍鲸,那么WGCNA是什么闷叉,它可以應(yīng)用于哪些研究方向,又如何進(jìn)行WGCNA分析脊阴,其分析結(jié)果如何看握侧?現(xiàn)在就帶著這些問(wèn)題,跟著小易一起學(xué)習(xí)探討吧蹬叭!
1? WGCNA介紹
? ? ? ?WGCNA(weighted gene co-expression network analysis藕咏,權(quán)重基因共表達(dá)網(wǎng)絡(luò)分析)是一種分析多個(gè)樣本基因表達(dá)模式的分析方法状知,可將表達(dá)模式相似的基因進(jìn)行聚類秽五,并分析模塊與特定性狀或表型之間的關(guān)聯(lián)關(guān)系,因此在疾病以及其他性狀與基因關(guān)聯(lián)分析等方面的研究中被廣泛應(yīng)用饥悴。
WGCNA與趨勢(shì)分析都為基因共表達(dá)分析方法坦喘,與趨勢(shì)分析相比,WGCNA 有以下優(yōu)勢(shì):
a. 聚類方法:使用權(quán)重共表達(dá)策略(無(wú)尺度分布)西设,更加符合生物學(xué)現(xiàn)象瓣铣;
b. 基因間的關(guān)系:能呈現(xiàn)基因間的相互作用關(guān)系,且能找出處于調(diào)控網(wǎng)絡(luò)中心的 hub 基因贷揽;
c. 樣本數(shù):適合于大樣本量棠笑,且越多越好。而趨勢(shì)分析 5 個(gè)點(diǎn)以上就會(huì)使結(jié)果非常復(fù)雜禽绪,準(zhǔn)確性 下降蓖救,且最多只能分析 8 個(gè)點(diǎn);
d. 與表型的關(guān)聯(lián):可以利用模塊特征值和 hub 基因與特定性狀印屁、表型進(jìn)行關(guān)聯(lián)分析循捺,更準(zhǔn)確地分 析生物學(xué)問(wèn)題。
2? 數(shù)據(jù)要求
? ? ? ?由于WGCNA是基于相關(guān)系數(shù)的表達(dá)調(diào)控網(wǎng)絡(luò)分析方法雄人。當(dāng)樣本數(shù)過(guò)低的時(shí)候从橘,相關(guān)系數(shù)的計(jì)算是不可靠的,得到的調(diào)控網(wǎng)絡(luò)價(jià)值不大。所以恰力,我們推薦的樣本數(shù)如下:
a. 當(dāng)獨(dú)立樣本數(shù)≥8(非重復(fù)樣本)時(shí)叉谜,可以考慮基于Pearson相關(guān)系數(shù)的WGCNA共表達(dá)網(wǎng)絡(luò)的方法(效果看實(shí)際情況而定);
b. 當(dāng)樣本數(shù)≥15(可以包含生物學(xué)重復(fù))時(shí)踩萎,WGCNA方法會(huì)有更好的效果正罢。
c. 當(dāng)樣品數(shù)<8時(shí),不建議進(jìn)行該項(xiàng)分析驻民。
d. 該方法對(duì)于不同材料或不同組織進(jìn)行分析更有意義翻具,對(duì)于不同時(shí)間點(diǎn)處理相同樣品意義不大。
3? 分析代碼
a. wgcna01_1.R回还,wgcna01_2.R是用來(lái)構(gòu)建網(wǎng)絡(luò)和相關(guān)圖裆泳,適用于性狀關(guān)聯(lián)和無(wú)性狀關(guān)聯(lián)的WGCNA網(wǎng)絡(luò)構(gòu)建。因?yàn)檫@兩步很費(fèi)時(shí)柠硕,就拆開(kāi)了并行跑工禾。
用法:Rscript wgcna01_1.R expre.xls;
b. wgcna02.R 是可視化基因網(wǎng)絡(luò)圖。
Rscript wgcna02.R trait.txt(optional)
代碼:wgcna01_1
# Description: To perform co‐expression analysis based on WGCNA Rpackage
# Version: 1.0, mics.com, 2018‐10‐09
# Usage: Rscript wgcna.R rpkm.xls outdir
args <‐ commandArgs(TRUE);
exp_data=args[1]
# Part 1. Basic Analysis
# 1. Data input, cleaning and pre‐processing
# 1.1 Loading expression data
# Load the package
library(WGCNA);
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the data set
data = read.table(exp_data, head=T, row.names=1);
multiExpr = t(data);
# 1.2 Checking data for excessive missing values
gsg = goodSamplesGenes(multiExpr, verbose = 3);
#
if (!gsg$allOK)
{
# Remove the offending genes and samples from the data:
multiExpr = multiExpr[gsg$goodSamples, gsg$goodGenes]
}
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
if(nGenes > 30000)
{
library(genefilter)
data = read.table(exp_data, head=T, row.names=1)
f1 <‐ pOverA(0.8,0.5)
f2 <‐ function(x) (IQR(x) > 0.5)
ff <‐ filterfun(f1, f2)
selected <‐ genefilter(data, ff)
sum(selected)
esetSub <‐ data[selected, ]
multiExpr <‐ as.data.frame(t(esetSub))
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
}
save(multiExpr, file = "01‐dataInput_Expr.RData")
####Plot 1 samples cluster
sampleTree = hclust(dist(multiExpr), method = "average");
sizeGrWindow(12,9)
pdf(file = "sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers",
sub="", xlab="", cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
dev.off()
#########################################remove outlier samples(depend
on samples totally)
# Plot a line to show the cut
#abline(h = 15, col = "red");
# Determine cluster under the line
#clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
#table(clust)
# clust 1 contains the samples we want to keep.
#keepSamples = (clust==1)
#multiExpr = multiExpr[keepSamples, ]
#nGenes = ncol(multiExpr)
#nSamples = nrow(multiExpr)
# 2. Step‐by‐step network construction and module detection
#=====================================================================
=========
# 2.1 Choosing the soft‐thresholding power: analysis of network
# Choose a set of soft‐thresholding powers
powers = c(c(1:10), seq(from = 12, to=30, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(multiExpr, powerVector = powers, verbose = 5 )
save(sft, file = "01‐sftpower.RData")
####Plot 2 selected power
pdf(file ="Network_power.pdf", width = 9, height = 5);
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale‐free topology fit index as a function of the soft‐thresholdingpower
plot(sft$fitIndices[,1], ‐sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology ModelFit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], ‐sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
abline(h=0.85,col="red")
# Mean connectivity as a function of the soft‐thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers,
cex=cex1,col="red")
dev.off()
# Select power
softPower <‐ sft$powerEstimate
####Plot 3 檢驗(yàn)選定的β值下記憶網(wǎng)絡(luò)是否逼近 scale free
# 基因多的時(shí)候使用下面的代碼:
k <‐ softConnectivity(datE=multiExpr,power=softPower)
pdf(file = "Check_Scalefree.pdf", width = 10, height = 5);
par(mfrow=c(1,2))
hist(k)
scaleFreePlot(k,main="Check Scale free topology\n")
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
# 2.2 One‐step network construction and module detection
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
#2.2.1 獲得臨近矩陣,根據(jù)β值獲得臨近矩陣和拓?fù)渚仃嚮热幔?/p>
softPower <‐ sft$powerEstimate
adjacency = adjacency(multiExpr, power = softPower);
# 將臨近矩陣轉(zhuǎn)為 Tom 矩陣
TOM = TOMsimilarity(adjacency);
# 計(jì)算基因之間的相異度
dissTOM = 1‐TOM
save(TOM, file = "TOMsimilarity_adjacency.RData")
#2.2.2a Clustering using TOM
geneTree = hclust(as.dist(dissTOM),method="average");
minModuleSize = 30
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
minClusterSize = minModuleSize);
table(dynamicMods)
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
#Plot 4 plot the dendrogram and colors underneath
pdf(file = "GeneDendrogramColors.pdf", width = 12, height = 9);
sizeGrWindow(12,9)
plotDendroAndColors(geneTree, dynamicColors,
"Dynamic Tree Cut",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05,
main = "Gene dendrogram and module colors")
dev.off()
#2.2.2b Merging of modules whose expression profiles are very similar
MEList = moduleEigengenes(multiExpr, colors = dynamicColors)
MEs = MEList$eigengenes
MEDiss = 1‐cor(MEs);
METree = hclust(as.dist(MEDiss), method = "average");
#Plot 5 merge the dendrogram and colors underneath for similar module
pdf(file = "Clustering_similar_module.pdf", width = 12, height = 9);
sizeGrWindow(12, 9)
plot(METree, main = "Clustering of similar module eigengenes",
xlab = "", sub = "")
MEDissThres = 0.25
abline(h=MEDissThres, col = "red")
dev.off()
merge = mergeCloseModules(multiExpr, dynamicColors, cutHeight =
MEDissThres, verbose = 3)
mergedColors = merge$colors;
mergedMEs = merge$newMEs;
#Plot 6 plot the gene dendrogram again, with the original and merged
module colors underneath
pdf(file = "Merged_GeneDendrogramColors.pdf", width = 12, height = 9)
sizeGrWindow(12, 9)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors), c("Dynamic Tree Cut", "Merged dynamic"),
dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)
dev.off()
moduleColors = mergedColors
colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)‐1;
MEs = mergedMEs;
save(MEs, moduleLabels, moduleColors, geneTree, file = "02‐networkConstruction‐stepByStep.RData")
代碼:wgcna01_2
args <‐ commandArgs(TRUE);
exp_data=args[1]
# Part 1.1 Tom Network visualization
# 1. Data input, cleaning and pre‐processing
# 1.1 Loading expression data
# Load the package
library(WGCNA);
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the data set
data = read.table(exp_data, head=T, row.names=1);
multiExpr = t(data)
# 1.2 Checking data for excessive missing values
gsg = goodSamplesGenes(multiExpr, verbose = 3);
#
if (!gsg$allOK)
{
# Remove the offending genes and samples from the data:
multiExpr = multiExpr[gsg$goodSamples, gsg$goodGenes]
}
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
if(nGenes > 30000)
{
library(genefilter)
data = read.table(exp_data, head=T, row.names=1)
f1 <‐ pOverA(0.8,0.5)
f2 <‐ function(x) (IQR(x) > 0.5)
ff <‐ filterfun(f1, f2)
selected <‐ genefilter(data, ff)
sum(selected)
esetSub <‐ data[selected, ]
multiExpr <‐ as.data.frame(t(esetSub))
nGenes = ncol(multiExpr)
nSamples = nrow(multiExpr)
}
# 1.3 Choosing the soft‐thresholding power: analysis of network
topology
# Choose a set of soft‐thresholding powers
powers = c(c(1:10), seq(from = 12, to=30, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(multiExpr, powerVector = powers, verbose = 5 )
softPower <‐ sft$powerEstimate
# 1.4 Visualizing the gene network
# Calculate topological overlap anew: this could be done more
efficiently by saving the TOM
TOM = TOMsimilarityFromExpr(multiExpr, power = softPower);
save(TOM, file = "TOMsimilarityFromExpr_multiExpr.power.RData")
dissTOM = 1‐TOM
nSelect = 500
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
#Plot 1 Visualizing the gene network in all modules for Tom matrix
pdf(file="Tom_GeneNetworkHeatmap.pdf", width = 9, height = 9);
sizeGrWindow(9,9)
plotDiss = selectTOM^7; diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Gene Network
heatmap plot")
dev.off()?
代碼:wgcna02
library(WGCNA)
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
lname1= load("01‐dataInput_Expr.RData")
lname2=load("01‐sftpower.RData")
lname3=load("02‐networkConstruction‐stepByStep.RData")
args <‐ commandArgs(TRUE);
trait_data=args[1]
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
# 3.2 Visualizing the network of module eigengenes
#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
‐‐‐‐‐‐‐‐‐
# Recalculate module eigengenes
MEs = moduleEigengenes(multiExpr, moduleColors)$eigengenes
MET = orderMEs(MEs)
#Plot 3.2.a plot the relationships among the module eigengenes
pdf(file="Module_Eigengene_network.pdf", width = 9, height = 9);
par(mfrow=c(1,2))
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro =
c(0,4,2,0),plotHeatmaps = FALSE)
par(cex = 1.0)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap =
c(3,4,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
#Plot 3.2.b plot MEs corelation between different module
library(PerformanceAnalytics)
Matrix_MEs <‐ as.matrix(MEs)
pdf(file="Correlation_modules.pdf",width = 9, height = 9)
chart.Correlation(Matrix_MEs,histogram = TRUE,pch=19)
dev.off()
#=====================================================================
=========
# 4.Find all genes for each module and Exporting to Cytoscape
#=====================================================================
=========
# Recalculate topological overlap if needed
# Select modules
Allmodules = unique(moduleColors);
multiExpr <‐ as.data.frame(multiExpr)
probes = names(multiExpr) ###"probes means gene names"
# Select module probes
for (i in Allmodules){
module=i
inModule = is.finite(match(moduleColors, module));
modProbes = probes[inModule];
#4.1 Write all gene from each modules
write.table(modProbes,paste("allgenefrom_module",module,".txt",
sep=""),quote = FALSE,row.names = F,col.names = FALSE,sep="\t")
#4.2 Select the corresponding Topological Overlap for Cytoscape
input files
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can
read
cyt = exportNetworkToCytoscape(modTOM,
edgeFile = paste("CytoscapeInput‐edges‐", module, ".txt",sep=""),
nodeFile = paste("CytoscapeInput‐nodes‐", module, ".txt",sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
altNodeNames = modProbes,
nodeAttr = moduleColors[inModule])
}
#=====================================================================
=========
# 5. Find the hub gene for all modules
#=====================================================================
=========
hubs <‐ chooseTopHubInEachModule(multiExpr, moduleColors,
omitColors = "grey",
power = 14,
type = "unsigned")
H <‐ as.matrix(hubs)
write.table(H,"hubGenes_foreachModule.txt",quote = FALSE,row.names =
T,col.names = FALSE,sep="\t")
# Part 6. External Analysis
if(file.exists(trait_data)){
#=====================================================================
=========
# 6.1 Loading trait data
#=====================================================================
=========
traitData = read.table(trait_data);
# Form a data frame analogous to expression data that will hold the
clinical traits.
nGenes =ncol(multiExpr);
nSamples = nrow(multiExpr);
sampleName=rownames(multiExpr)
traitRows = match(sampleName,traitData$Sample) #colum name for all
sample must be called "Sample"
datTraits = traitData[traitRows, ‐1];
rownames(datTraits) = traitData[traitRows, 1];
save(datTraits,file="01‐dataInput_Trait.RData")
collectGarbage();
# 6.2 Quantifying module–trait associations
MEs0 = moduleEigengenes(multiExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)
moduleTraitCor = cor(MEs, datTraits, use = "p");
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
####Plot1 Module?trait relationships in heatmap plot
sizeGrWindow(10,6)
pdf(file = "Module_trait_relationships.pdf", width = 12, height = 9);
#Will display correlations and their p‐values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep ="");
dim(textMatrix) = dim(moduleTraitCor)
par(mar = c(6, 8.5, 3, 3));
#Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = greenWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(‐1,1),
main = paste("Module‐trait relationships"))
dev.off()
}
4??結(jié)果說(shuō)明
4.1 GeneDendrogramColors.pdf:基因進(jìn)化樹(shù)(基因共表達(dá)模塊分析)
? ? ? ?橫坐標(biāo):基因共表達(dá)模塊分析的結(jié)果展示闻葵,不同的顏色代表不同的基因模塊,灰色屬于未知模塊的基因
? ? ? ?縱坐標(biāo):基因間的不相似性系數(shù)癣丧,進(jìn)化樹(shù)的每一個(gè)樹(shù)枝代表一個(gè)基因
4.2 NetworkHeatmap.pdf :共表達(dá)模塊基因相關(guān)系數(shù)熱圖
? ? ? ?圖上和圖左是基因系統(tǒng)發(fā)育樹(shù)槽畔,每一個(gè)樹(shù)枝代表一個(gè)基因,不同顏色亮帶表示不同的module胁编,每一個(gè)亮點(diǎn)表示每個(gè)基因與其他基因的相關(guān)性強(qiáng)弱厢钧,每個(gè)點(diǎn)的顏色越深(白→黃→紅)代表行和列對(duì)應(yīng)的兩個(gè)基因間的連通性越強(qiáng)。P值采用student's t test計(jì)算所得嬉橙,P值越小代表基因與模塊相關(guān)性的顯著性越強(qiáng)早直。
4.3 EigengeneDendrogramHeatmap.pdf :共表達(dá)模塊的聚類圖和熱圖
? ? ? ?聚類圖:對(duì)每個(gè)基因模塊的特征向量基因進(jìn)行聚類,
? ? ? ?熱圖:我們將所有模塊兩兩間的相關(guān)性進(jìn)行分析,并繪制熱圖市框,顏色越紅霞扬,相關(guān)性越高
4.4 ModuleSampleRelation.pdf :?性狀相關(guān)特征模塊分析(此分析需結(jié)合生理數(shù)據(jù),如沒(méi)有生理數(shù)據(jù)則沒(méi)有此結(jié)果)
? ? ? ?每一列代表一個(gè)性狀枫振,每一行代表一個(gè)基因模塊喻圃。每個(gè)格子里的數(shù)字代表模塊與性狀的相關(guān)性,該數(shù)值越接近1蒋得,表示模塊與性狀正相關(guān)性越強(qiáng)级及;越接近-1,表示模塊與性狀負(fù)相關(guān)性越強(qiáng)额衙。括號(hào)里的數(shù)字代表顯著性P value饮焦,該數(shù)值越小怕吴,表示顯著性越強(qiáng)。P值采用student's t test計(jì)算所得县踢,P值越小代表性狀與模塊相關(guān)性的顯著性越強(qiáng)转绷。