SCENIC全稱:Single-Cell rEgulatory Network Inference and Clustering
實戰(zhàn)之前需要準(zhǔn)備的學(xué)習(xí)資料:
1.閱讀關(guān)于SCENIC的介紹:SCENIC | 以single-cell RNA-seq數(shù)據(jù)推斷基因調(diào)控網(wǎng)絡(luò)和細(xì)胞功能聚類
2.閱讀文獻(xiàn):SCENIC: single-cell regulatory network inference and clustering [PMID: 28991892]
?以上兩部分介紹SCENIC能做什么與SCENIC標(biāo)準(zhǔn)流程
3.下載官方教程:aertslab/SCENIC
?下載的文件中重點看兩部分文件SCENIC_Running.html與detailedStep_0-4.html
? - SCENIC_Running.html里有每一步的說明與標(biāo)準(zhǔn)流程代碼
? - detailedStep_0-4.html里有標(biāo)準(zhǔn)流程代碼中包裝好的函數(shù)中的詳細(xì)代碼,主要是為了有個性化用戶使用時修改代碼莺戒。入門實戰(zhàn)也需要看伴嗡,主要是為了進(jìn)一步了解標(biāo)準(zhǔn)流程中使用的函數(shù)具體做了什么,明白每一步產(chǎn)生的數(shù)據(jù)是什么从铲,數(shù)據(jù)的名稱瘪校,存儲在了那一文件夾中,以及執(zhí)行一個函數(shù)時調(diào)用了哪一個數(shù)據(jù)。
標(biāo)準(zhǔn)流程如下阱扬,基于3個R包:GENIE3泣懊,RcisTarget,AUCell:
? ? ? ? ? ? ? ? ? ? 操練起來
1.在R中加載表達(dá)矩陣與樣品(細(xì)胞)信息
rm(list = ls())
options(stringsAsFactors = F)
#Loading data
load(file = 'input_data.Rdata')
#Filter the samples based on the standatd in the article.
table(meta$Mapping.rate>30)
meta_filter1<-meta[meta$Mapping.rate>30,]
meta_filter2<-meta_filter1[meta_filter1$Gene.number>1000,]
meta_filter3<-meta_filter2[meta_filter2$Transcript.number>=10000,]
dim(meta_filter3)
#Choose the keeped the samples in the expression matrix.
keep.samples<-as.character(meta_filter3[,1])
counts_filter<-counts[,keep.samples]
dim(counts_filter)
counts_filter[1:4,1:10]
#Check the expression matrix
fivenum(apply(counts_filter,1,function(x) sum(x>0)))
hist(apply(counts_filter,2,function(x) sum(x>0)),breaks = 100)
#Transform the counts number to CPM in counts_filter
library(edgeR)
counts_filter<-cpm(counts_filter)
載入的表達(dá)矩陣counts_filter為基因表達(dá)的count值麻惶,我們這里只通過使用edgeR包中的CPM()函數(shù)去除了文庫大小馍刮,未做log處理。原因見官網(wǎng)描述(如下):
Expression units: The preferred expression values are gene-summarized counts. There is currently not a strong recommendation towards using the raw counts, or counts normalized through single-cell specific methods (e.g. Seurat). Other measurements, such as transcripts/counts per million (TPM) and FPKM/RPKM, are also accepted as input. However, note that some authors recommend avoiding within sample normalization (i.e. TPM) for co-expression analysis (first step of SCENIC) because they may induce artificial co-variation (Crow et al. (2016)). The choice of input expression matrix might have some effect on the co-expression analysis to create the regulons (step 1). The other steps of the workflow are not directly affected by the input expression values: (2) The expression is not taken into account for the motif analysis, and (3) AUCell, which is used for scoring the regulons on the cells, is cell ranking-based (it works as an implicit normalization). Overall, SCENIC is quite robust to this choice, we have applied SCENIC to datasets using raw (logged) UMI counts, normalized UMI counts, and TPM and they all provided reliable results (see Aibar et al. (2017)).
2. 使用Seurat包創(chuàng)建對象
#Create subject using Seurat R package
suppressMessages(library(Seurat))
Pollen <- CreateSeuratObject(counts = counts_filter,
meta.data =meta_filter3,
min.cells = 300,
min.features = 1500,
project = "Pollen")
Pollen
sce <- Pollen
Seurat包用法與創(chuàng)建對象時參數(shù)的設(shè)置詳見生信技能樹B站教程:全網(wǎng)第一的單細(xì)胞轉(zhuǎn)錄組實戰(zhàn)演練
3. 提取sce對象中的表達(dá)矩陣與樣品(cell)信息并保存
exprMat<-GetAssayData(object = sce,slot = "counts")
class(exprMat)
dim(exprMat)
exprMat[1:4,1:10]
cellInfo <-sce@meta.data
dim(cellInfo)
save(exprMat,cellInfo,file = 'Mat and cell.Rdata')
4.下載RcisTarget的物種特定數(shù)據(jù)庫(這里是第一個坑)
?RcisTarget數(shù)據(jù)庫中的數(shù)據(jù)目前只支持人用踩,鼠渠退,果蠅三個物種。實戰(zhàn)用的數(shù)據(jù)是人膝關(guān)節(jié)軟骨細(xì)胞脐彩,因此下載了人的feather文件碎乃。為什么說這里是個坑呢?官網(wǎng)給的下載方式是這樣的:
dbFiles <- c("https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-500bp-upstream-7species.mc9nr.feather",
"https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-tss-centered-10kb-7species.mc9nr.feather")
# dir.create("cisTarget_databases"); setwd("cisTarget_databases") # if needed
for(featherURL in dbFiles)
{
download.file(featherURL, destfile=basename(featherURL)) # saved in current dir
}
?在大陸下載這個數(shù)據(jù)的速度真的很感人(也可能是我家網(wǎng)不行惠奸,但是玩LOL剛剛的)梅誓,容易下一半直接掛掉。直接掛掉還是好的佛南,如果你下載下來了梗掰,文件大小也與官網(wǎng)給的一樣,但其實里面的數(shù)據(jù)是不對的嗅回。后續(xù)分析調(diào)用這個文件的時候R會報錯及穗。我第一次使用的時候就是這樣。
?解決的方案是:手動下載(無奈辦了一個迅雷會員绵载,1.02G的feather文件分分鐘下好)并使用軟件MD5 & SHA Checksum Utility檢查文件的完整性埂陆,與官網(wǎng)給的sha256sum進(jìn)行比較,sha256sum一致娃豹,即可用于后續(xù)分析焚虱。MD5 & SHA Checksum Utility的用法自行百度。
5. SCENIC的初始設(shè)置
library(SCENIC)
org <- "hgnc"
dbDir <- "cisTarget_databases"
myDatasetTitle <- "SCENIC analysis on Human cartilage" # Create a name for your analysis
data(defaultDbNames)
dbs <- defaultDbNames[[org]]
scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=10)
#scenicOptions <- readRDS("int/scenicOptions.Rds")
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
這里SCENIC創(chuàng)建了一個對象scenicOptions懂版,后續(xù)分析都是基于這個對象來進(jìn)行的鹃栽。
此外,在后續(xù)分析中產(chǎn)生的數(shù)據(jù)會存儲在R當(dāng)前目錄下的int文件夾躯畴,SCENIC輸出的結(jié)果會存儲在R當(dāng)前目錄下的output文件夾民鼓。
6. 將表達(dá)矩陣中未在cisTarget database中收錄的基因去除
library(RcisTarget)
dbFilePath <- getDatabases(scenicOptions)[[1]]
motifRankings <- importRankings(dbFilePath)
genesInDatabase <- colnames(getRanking(motifRankings))
genesLeft_minCells<-rownames(exprMat)
length(genesLeft_minCells)
genesLeft_minCells_inDatabases <- genesLeft_minCells[which(genesLeft_minCells %in% genesInDatabase)]
length(genesLeft_minCells_inDatabases)
genesKept <- genesLeft_minCells_inDatabases
exprMat_filtered <- exprMat[genesKept, ]
dim(exprMat_filtered)
7. 計算相關(guān)性并對數(shù)據(jù)進(jìn)行l(wèi)og處理
class(exprMat_filtered)
runCorrelation(as.matrix(exprMat_filtered), scenicOptions)
exprMat_filtered <- log2(exprMat_filtered+1)
8. 運行GENIE3
library(GENIE3)
runGenie3(as.matrix(exprMat_filtered), scenicOptions)
save(exprMat_filtered,scenicOptions,file = "input_GENIE3_data.Rdata")
官網(wǎng)教程中這里沒有as.matrix()函數(shù),我在運行的時候報錯了私股,class()以后顯示exprMat_filtered是一個matrix,不知道為什么會報錯摹察,但是加入as.matrix()函數(shù)就可以了。
9.建立GRN并對其打分
?為了節(jié)約電腦內(nèi)存倡鲸,這里重啟了R并重新載入數(shù)據(jù)
load('Mat and cell.Rdata')
logMat <- log2(exprMat+1)
dim(exprMat)
?運行SCENIC
library(SCENIC)
scenicOptions <- readRDS("int/scenicOptions.Rds")
scenicOptions@settings$verbose <- TRUE
scenicOptions@settings$nCores <- 1
scenicOptions@settings$seed <- 123
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"]
runSCENIC_1_coexNetwork2modules(scenicOptions)
runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget"))
runSCENIC_3_scoreCells(scenicOptions, as.matrix(logMat))
#電腦內(nèi)存只有16G,以上步驟相當(dāng)耗時黄娘,可以睡前運行峭状,醒來結(jié)果就出來了=克滴。=!
newThresholds <- savedSelections$thresholds
scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
#網(wǎng)絡(luò)活性的結(jié)果進(jìn)行二維化
runSCENIC_4_aucell_binarize(scenicOptions)
?結(jié)果可視化
aucellApp <- plotTsne_AUCellApp(scenicOptions, logMat)
savedSelections <- shiny::runApp(aucellApp)
左上角可以調(diào)不同轉(zhuǎn)錄因子,觀察每個regulon的可視化結(jié)果优床。
計算得到的regulon(轉(zhuǎn)錄因子與其預(yù)測靶基因)結(jié)果存儲在output文件夾中:Step2_regulonTargetsInfo
到這里SCENIC的主干流程即已完成劝赔,后續(xù)對結(jié)果的解讀與其他可視化見官網(wǎng)教程。