MethylMix: an R package for identifying DNA methylation-driven genes.
本文參考:識(shí)別甲基化驅(qū)動(dòng)的癌癥基因
GetData函數(shù)栋猖,怎么爬取數(shù)據(jù):https://www.bioconductor.org/developers/how-to/web-query/
參考download.file函數(shù):https://github.com/MSQ-123/MethylMix/blob/master/R/Download_Preprocess.R
這個(gè)包是用純R寫(xiě)的,可以參考一下其結(jié)構(gòu):怎么寫(xiě)函數(shù)爬取TCGA的data
background
這個(gè)包的功能是鑒定潛在的甲基化驅(qū)動(dòng)的癌基因蟀拷。也就是說(shuō),如果一個(gè)基因在癌癥樣本中相比正常樣本低表達(dá)峭跳,且這個(gè)基因在癌癥樣本中被高度甲基化,那么我們認(rèn)為這個(gè)基因就是一個(gè)driver gene
driver gene 的差異甲基化值(DM-value)就是該基因的差異甲基化程度,對(duì)于混合的多個(gè)相同癌癥類(lèi)型的樣本,還可以根據(jù)它們的DM-value袍啡,對(duì)該癌癥類(lèi)型進(jìn)行分亞型的操作。
Run
第一步獲取數(shù)據(jù):包提供了獲取數(shù)據(jù)的函數(shù)却桶,無(wú)須自己下載境输,但速度感人:
Step 1: Automated Downloading from TCGA: DNA methylation datasets and Gene expression Datasets are downloaded automatically from TCGA by supplying the TCGA cancer code. We have provided the functionality to study any of the 33 TCGA cancer sites that are currently available.
以卵巢癌為例:
library(doParallel)
cancerSite <- "OV"
targetDirectory <- paste0(getwd(), "/")
cl <- makeCluster(5)
registerDoParallel(cl)
# Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)
# Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
# Saving methylation processed data
saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
# Downloading gene expression data
GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)
# Processing gene expression data
GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)
# Saving gene expression processed data
saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))
# Clustering probes to genes methylation data
METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
# Putting everything together in one file
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],
GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
stopCluster(cl)
函數(shù)會(huì)自動(dòng)捕獲27k or 450k的甲基化芯片數(shù)據(jù),并進(jìn)行適當(dāng)?shù)暮喜⒂毕担籇ownload函數(shù)下載的數(shù)據(jù)不能直接拿來(lái)分析嗅剖,需要預(yù)處理:預(yù)處理包括校正批次效應(yīng),缺省值模擬嘁扼,CpG探針檢測(cè)位點(diǎn)clustering信粮,缺省值過(guò)多的數(shù)據(jù)直接過(guò)濾等操作,出來(lái)以后直接是matrix對(duì)象趁啸,可以直接用來(lái)下游分析强缘,很方便。
這里我們用包的內(nèi)置數(shù)據(jù)運(yùn)行一下:
#examples
data(METcancer)
data(METnormal)
data(GEcancer)
head(METcancer[, 1:4])
head(GEcancer)
基本需要的input就是METcancer(癌癥樣本的甲基化數(shù)據(jù))莲绰、METnormal(正常樣本的甲基化數(shù)據(jù))以及GEcancer(癌癥樣本的基因表達(dá)量數(shù)據(jù))欺旧。核心函數(shù)是MethylMix:
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
## Found 251 samples with both methylation and expression data.
## Correlating methylation data with gene expression...
##
## Found 9 transcriptionally predictive genes.
##
## Starting Beta mixture modeling.
## Running Beta mixture model on 9 genes and on 251 samples.
## ERBB2 : 2 components are best.
## FAAH : 2 components are best.
## FOXD1 : 2 components are best.
## ME1 : 2 components are best.
## MGMT : 2 components are best.
## OAS1 : 2 components are best.
## SOX10 : 2 components are best.
## TRAF6 : 2 components are best.
## ZNF217 : 2 components are best.
解釋一下Results里面每個(gè)參數(shù)的含義:
-
MethylationDrivers
: Driver genes,利用beta分布模擬出來(lái)的驅(qū)動(dòng)基因蛤签。 -
NrComponents
: The number of methylation states found for each driver gene. 正常來(lái)說(shuō)辞友,methylation states有兩個(gè):hypo和hyper -
MixtureStates
: 每個(gè)驅(qū)動(dòng)基因的DM-values,應(yīng)該是取平均的值 -
MethylationStates
: 一個(gè)矩陣震肮,每個(gè)驅(qū)動(dòng)基因在每個(gè)樣本中的DM-values -
Classifications
: 對(duì)于每個(gè)樣本每個(gè)基因的methylation states称龙,值為1對(duì)應(yīng)hypo,值為2對(duì)應(yīng)hyper -
Models
: Beta分布模型的參數(shù)
獲得的Results可以進(jìn)行可視化:選擇需要可視化的基因名戳晌,
# Plot the most famous methylated gene for glioblastoma
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer)
plots$MixtureModelPlot
# Plot MGMT also with its normal methylation variation
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
# Plot a MethylMix model for another gene
plots <- MethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)
plots$MixtureModelPlot
同樣地鲫尊,我們可以結(jié)合表達(dá)量數(shù)據(jù),探究基因的表達(dá)量和甲基化水平的負(fù)相關(guān)關(guān)系:
# Also plot the inverse correlation with gene expression (creates two separate
# plots)
plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)
plots$MixtureModelPlot
plots$CorrelationPlot
# Plot all functional and differential genes
for (gene in MethylMixResults$MethylationDrivers) {
MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)
}