前言
通過學習前面幾個模塊癌淮,我們已經(jīng)發(fā)現(xiàn)了基于多組學數(shù)據(jù)找出的乳腺癌各亞型間具有非常顯著的分子差異髓需,而我們?nèi)绻肷钊胪诰蚱浔澈髾C制,就需要找出差異表達的基因是哪些,以及這些基因具有什么樣的功能亏推。而MOVICS第三個模塊颜懊,也就是最后一個模塊,將會幫助我們實現(xiàn)這個目的避诽。
這一模塊的內(nèi)容需要有一定生物學背景知識才能較好的理解龟虎,大家可以通過生信往期的一些推文來進行學習。
主要函數(shù)
這一部分的函數(shù)主要是行駛差異分析以及富集分析的功能沙庐,MOVICS內(nèi)置了多種常用分析的函數(shù)鲤妥,大家使用時可以根據(jù)需要進行選擇佳吞。
runDEA(): run differential expression analysis with three popular methods for choosing, including edgeR, DESeq2, and limma runMarker(): run biomarker identification to determine uniquely and significantly differential expressed genes for each subtype
runGSEA(): run gene set enrichment analysis (GSEA), calculate activity of functional pathways and generate a pathway-specific heatmap
runGSVA(): run gene set variation analysis to calculate enrichment score of each sample based on given gene set list of interest
runNTP(): run nearest template prediction based on identified biomarkers to evaluate subtypes in external cohorts
runPAM(): run partition around medoids classifier based on discovery cohort to predict subtypes in external cohorts
runKappa(): run consistency evaluation using Kappa statistics between two appraisements that identify or predict current subtypes
代碼演示
下面Immugent將進行實操演示:
# run DEA with edgeR
runDEA(dea.method = "edger",
expr = count, # raw count data
moic.res = cmoic.brca,
prefix = "TCGA-BRCA") # prefix of figure name
#> --all samples matched.
#> --you choose edger and please make sure an RNA-Seq count data was provided.
#> edger of CS1_vs_Others done...
#> edger of CS2_vs_Others done...
#> edger of CS3_vs_Others done...
#> edger of CS4_vs_Others done...
#> edger of CS5_vs_Others done...
# run DEA with DESeq2
runDEA(dea.method = "deseq2",
expr = count,
moic.res = cmoic.brca,
prefix = "TCGA-BRCA")
#> --all samples matched.
#> --you choose deseq2 and please make sure an RNA-Seq count data was provided.
#> deseq2 of CS1_vs_Others done...
#> deseq2 of CS2_vs_Others done...
#> deseq2 of CS3_vs_Others done...
#> deseq2 of CS4_vs_Others done...
#> deseq2 of CS5_vs_Others done...
# run DEA with limma
runDEA(dea.method = "limma",
expr = fpkm, # normalized expression data
moic.res = cmoic.brca,
prefix = "TCGA-BRCA")
算出差異基因后,做個熱圖來看一下棉安。
# choose edgeR result to identify subtype-specific up-regulated biomarkers
marker.up <- runMarker(moic.res = cmoic.brca,
dea.method = "edger", # name of DEA method
prefix = "TCGA-BRCA", # MUST be the same of argument in runDEA()
dat.path = getwd(), # path of DEA files
res.path = getwd(), # path to save marker files
p.cutoff = 0.05, # p cutoff to identify significant DEGs
p.adj.cutoff = 0.05, # padj cutoff to identify significant DEGs
dirct = "up", # direction of dysregulation in expression
n.marker = 100, # number of biomarkers for each subtype
doplot = TRUE, # generate diagonal heatmap
norm.expr = fpkm, # use normalized expression as heatmap input
annCol = annCol, # sample annotation in heatmap
annColors = annColors, # colors for sample annotation
show_rownames = FALSE, # show no rownames (biomarker name)
fig.name = "UPREGULATED BIOMARKER HEATMAP")
# MUST locate ABSOLUTE path of msigdb file
MSIGDB.FILE <- system.file("extdata", "c5.bp.v7.1.symbols.xls", package = "MOVICS", mustWork = TRUE)
# run GSEA to identify up-regulated GO pathways using results from edgeR
gsea.up <- runGSEA(moic.res = cmoic.brca,
dea.method = "edger", # name of DEA method
prefix = "TCGA-BRCA", # MUST be the same of argument in runDEA()
dat.path = getwd(), # path of DEA files
res.path = getwd(), # path to save GSEA files
msigdb.path = MSIGDB.FILE, # MUST be the ABSOLUTE path of msigdb file
norm.expr = fpkm, # use normalized expression to calculate enrichment score
dirct = "up", # direction of dysregulation in pathway
p.cutoff = 0.05, # p cutoff to identify significant pathways
p.adj.cutoff = 0.25, # padj cutoff to identify significant pathways
gsva.method = "gsva", # method to calculate single sample enrichment score
norm.method = "mean", # normalization method to calculate subtype-specific enrichment score
fig.name = "UPREGULATED PATHWAY HEATMAP")
除了GSEA分析還能進行GSVA分析底扳。
# MUST locate ABSOLUTE path of gene set file
GSET.FILE <-
system.file("extdata", "gene sets of interest.gmt", package = "MOVICS", mustWork = TRUE)
# run NTP in Yau cohort by using up-regulated biomarkers
yau.ntp.pred <- runNTP(expr = brca.yau$mRNA.expr,
templates = marker.up$templates, # the template has been already prepared in runMarker()
scaleFlag = TRUE, # scale input data (by default)
centerFlag = TRUE, # center input data (by default)
doPlot = TRUE, # to generate heatmap
fig.name = "NTP HEATMAP FOR YAU")
# compare survival outcome in Yau cohort
surv.yau <- compSurv(moic.res = yau.ntp.pred,
surv.info = brca.yau$clin.info,
convt.time = "m", # switch to month
surv.median.line = "hv", # switch to both
fig.name = "KAPLAN-MEIER CURVE OF NTP FOR YAU")
# compare agreement in Yau cohort
agree.yau <- compAgree(moic.res = yau.ntp.pred,
subt2comp = brca.yau$clin.info[, "PAM50", drop = FALSE],
doPlot = TRUE,
fig.name = "YAU PREDICTEDMOIC WITH PAM50")
除了NTP之外,MOVICS還提供了另一種無模型方法來預測亞型贡耽。具體來說花盐,runPAM()首先在發(fā)現(xiàn)訓練集(即TCGA-BRCA)中對medoids (PAM)分類器進行分區(qū)訓練,以預測外部驗證測試集(即BRCA-Yau)中患者的亞型菇爪,驗證隊列中的每個樣本被分配到一個子類型標簽,其質(zhì)心與樣本的皮爾遜相關性最高柒昏。最后凳宙,將執(zhí)行組內(nèi)比例(IGP)統(tǒng)計來評估發(fā)現(xiàn)隊列和驗證隊列之間獲得的亞型的相似性和重現(xiàn)性。
yau.pam.pred <- runPAM(train.expr = fpkm,
moic.res = cmoic.brca,
test.expr = brca.yau$mRNA.expr)
# predict subtype in discovery cohort using NTP
tcga.ntp.pred <- runNTP(expr = fpkm,
templates = marker.up$templates,
doPlot = FALSE)
#> --original template has 500 biomarkers and 500 are matched in external expression profile.
#> cosine correlation distance
#> 643 samples; 5 classes; 100-100 features/class
#> serial processing; 1000 permutation(s)...
#> predicted samples/class (FDR<0.05)
#>
#> CS1 CS2 CS3 CS4 CS5 <NA>
#> 99 105 138 155 107 39
# predict subtype in discovery cohort using PAM
tcga.pam.pred <- runPAM(train.expr = fpkm,
moic.res = cmoic.brca,
test.expr = fpkm)
#> --all samples matched.
#> --a total of 13771 genes shared and used.
#> --log2 transformation done for training expression data.
#> --log2 transformation done for testing expression data.
# check consistency between current and NTP-predicted subtype in discovery TCGA-BRCA
runKappa(subt1 = cmoic.brca$clust.res$clust,
subt2 = tcga.ntp.pred$clust.res$clust,
subt1.lab = "CMOIC",
subt2.lab = "NTP",
fig.name = "CONSISTENCY HEATMAP FOR TCGA between CMOIC and NTP")
# check consistency between current and PAM-predicted subtype in discovery TCGA-BRCA
runKappa(subt1 = cmoic.brca$clust.res$clust,
subt2 = tcga.pam.pred$clust.res$clust,
subt1.lab = "CMOIC",
subt2.lab = "PAM",
fig.name = "CONSISTENCY HEATMAP FOR TCGA between CMOIC and PAM")
# check consistency between NTP and PAM-predicted subtype in validation Yau-BRCA
runKappa(subt1 = yau.ntp.pred$clust.res$clust,
subt2 = yau.pam.pred$clust.res$clust,
subt1.lab = "NTP",
subt2.lab = "PAM",
fig.name = "CONSISTENCY HEATMAP FOR YAU")
總結(jié)
截止到這篇推文职祷,有關MOVICS包的所有實操部分內(nèi)容都已結(jié)束氏涩,本來對這個R包的所有介紹內(nèi)容都應該完結(jié)了。
但是Immugent本著對粉絲極度負責任的態(tài)度有梆,后續(xù)專門會解讀一篇完全依靠這個R包來分析的SCI文章是尖,直到教會大家如何使用MOVICS包分析自己的數(shù)據(jù)為止,敬請期待泥耀!