轉(zhuǎn)錄因子(transcription factors, TFs)是指能夠識別棕洋、結(jié)合在某基因上游特異核苷酸序列上的蛋白質(zhì)微饥。其通過介導RNA聚合酶與DNA模板的結(jié)合逗扒,從而調(diào)控下游靶基因的轉(zhuǎn)錄;也可以和其它轉(zhuǎn)錄因子形成轉(zhuǎn)錄因子復合體來影響基因的轉(zhuǎn)錄欠橘。轉(zhuǎn)錄因子在生命活動過程中,參與調(diào)節(jié)基因組DNA開放性矩肩、募集RNA聚合酶進行轉(zhuǎn)錄過程、募集輔助因子調(diào)節(jié)特定的轉(zhuǎn)錄階段,調(diào)控生命體的免疫肃续、發(fā)育等進程黍檩。
轉(zhuǎn)錄因子特異識別叉袍、結(jié)合的DNA序列稱為轉(zhuǎn)錄因子結(jié)合位點(Transcription factor binding site,TFBS),長度在5~20bp范圍刽酱。由于同一個TF可以調(diào)控多個喳逛,而其具體結(jié)合到每個靶基因的TFBS不完全相同,但具有一定的保守性肛跌。
1.
SCENIC(Single-Cell Regulatory Network Inference and Clustering)分析是對ScRNA-Seq數(shù)據(jù)中轉(zhuǎn)錄因子(Transcription Factors艺配,TFs)進行研究,最終篩選得到調(diào)控強度顯著衍慎、處于核心作用的TFs转唉,從而為探究其發(fā)病機制奠定基礎(chǔ)。目前稳捆,SCENIC已改為pySCENIC
SCENIC流程包括三步驟:
(1)使用GENIE3或GRNBoost(Gradient Boosting)基于共表達赠法,推斷轉(zhuǎn)錄因子與候選靶基因之間的共表達模塊。
(2)由于GENIE3模型只是基于共表達乔夯,會存在很多假陽性和間接靶標砖织,為了識別直接結(jié)合靶標(direct-binding targets),使用RcisTarget對每個共表達模塊進行順式調(diào)控基序(cis-regulatory motif)分析末荐。進行TF-motif富集分析侧纯,識別直接靶標。(僅保留具有正確的上游調(diào)節(jié)子且顯著富集的motif modules甲脏,并對它們進行修剪以除去缺乏motif支持的間接靶標眶熬。)這些處理后的每個TF及其潛在的直接targets gene被稱作一個調(diào)節(jié)子(regulon);
(3)使用AUCell算法對每個細胞的每個regulon活性進行打分。對于一個regulon來說块请,比較細胞間的AUCell得分可以鑒定出哪種細胞有顯著更高的subnetwork活性娜氏。結(jié)果生成一個二進制的regulon活性矩陣(binarized activity matrix),這將確定Regulon在哪些細胞中處于"打開"狀態(tài)
1.1 下載所需參考數(shù)據(jù)集(linux端)
$ cd /data/shumin/data/RcisTarget
# (1)motif–gene注釋數(shù)據(jù)(genome ranking database): https://resources.aertslab.org/cistarget/databases/
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather &
wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather &
# (2)motif–TF注釋數(shù)據(jù): https://resources.aertslab.org/cistarget/motif2tf/
wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
# (3)轉(zhuǎn)錄因子列表: https://resources.aertslab.org/cistarget/tf_lists/
wget https://resources.aertslab.org/cistarget/tf_lists/allTFs_hg38.txt
1.2 數(shù)據(jù)預處理
# 安裝Bioconductor和相關(guān)包
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c("SCENIC", "GENIE3", "RcisTarget"))
# 安裝其他依賴包
install.packages(c("Matrix", "data.table", "igraph", "pheatmap", "ggplot2"))
# 加載所需的R包
library(SCENIC)
library(GENIE3)
library(RcisTarget)
library(Matrix)
library(data.table)
library(igraph)
library(pheatmap)
library(ggplot2)
# 加載表達數(shù)據(jù)矩陣(假設(shè)數(shù)據(jù)已經(jīng)是一個數(shù)據(jù)框)
load("sub_Epithelial_0903.Rdata")
expr <- sub_Epithelial@assays[["RNA"]]@counts
expr_matrix <- as.matrix(expr)
# 提取細胞信息/表型數(shù)據(jù)
cellInfo <- data.frame(sub_Epithelial@meta.data)
cellInfo <- cellInfo[,c("Sample","RNA_snn_res.0.6","Group","subtype")]
head(cellInfo)
## Sample RNA_snn_res.0.6 Group subtype
## AGGAGGTATGTAGTCTC N11 13 Normal mucous acinar cells
## TGCACGGATGGACCATA N11 5 Normal serous acinar cells
## CCCTTAGATGCACGTCT N11 6 Normal serous acinar cells
## GTGGGAACGATTGCAGT N11 5 Normal serous acinar cells
## ACTACGATACTGCAGCG N11 8 Normal ductal cells
## GACATCACGATACTCAG N11 6 Normal serous acinar cells
dir.create("int")
saveRDS(cellInfo, file="int/cellInfo.Rds")
# 設(shè)置SCENIC分析的參數(shù)
mydbDIR <- "/data/shumin/data/RcisTarget"
mydbs <- c("hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather",
"hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather")
names(mydbs) <- c("500bp", "10kb")
scenicOptions <- initializeScenic(org = "hgnc",
nCores = 15,
dbDir = mydbDIR,
dbs = mydbs,
datasetTitle = "pSS")
saveRDS(scenicOptions, "int/scenicOptions.Rds")
1.3 計算共表達網(wǎng)絡(luò)
SCENIC正式分析的第一步是計算轉(zhuǎn)錄因子與每個基因的相關(guān)性
## 基因過濾
# 過濾標準:基因表達量之和 > 細胞數(shù)*3%墩新,且在1%的細胞中表達
genesKept <- geneFiltering(expr_matrix, scenicOptions = scenicOptions,
minCountsPerGene = 3*.01*ncol(expr_matrix),
minSamples = ncol(expr_matrix)*.01)
## Maximum value in the expression matrix: 61516
## Ratio of detected vs non-detected: 0.034
## Number of counts (in the dataset units) per gene:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1 27 6130 670 76233910
## Number of cells in which each gene is detected:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 1.0 25.0 918.5 563.0 25600.0
##
## Number of genes left after applying the following filters (sequential):
## 8444 genes with counts per gene > 836.88
## 8348 genes detected in more than 278.96 cells
## Using the column 'features' as feature index for the ranking database.
## 7226 genes available in RcisTarget database
## Gene list saved in int/1.1_genesKept.Rds
exprMat_filtered <- expr_matrix[genesKept, ]
dim(exprMat_filtered)
## [1] 7864 27896
# 計算相關(guān)性矩陣
runCorrelation(exprMat_filtered, scenicOptions)
# TF-Targets相關(guān)性回歸分析
exprMat_filtered_log <- log2(exprMat_filtered + 1)
data(list = "motifAnnotations_hgnc_v9", package = "RcisTarget")
motifAnnotations_hgnc <- motifAnnotations_hgnc_v9
runGenie3(exprMat_filtered_log, scenicOptions) # 這一步消耗的計算資源非常大贸弥,需要較長時間
## Using 689 TFs as potential regulators...
## Running GENIE3 part 1
## Running GENIE3 part 2
## Running GENIE3 part 3
## Running GENIE3 part 4
## Running GENIE3 part 5
## Running GENIE3 part 6
## Running GENIE3 part 7
## Running GENIE3 part 8
## Running GENIE3 part 9
## Running GENIE3 part 10
## Running GENIE3 part 11
## Running GENIE3 part 12
## Running GENIE3 part 13
## Running GENIE3 part 14
## Running GENIE3 part 15
## Running GENIE3 part 16
## Running GENIE3 part 17
## Running GENIE3 part 18
## Running GENIE3 part 19
## Running GENIE3 part 20
## Finished running GENIE3.
## Warning message:
## In runGenie3(exprMat_filtered, scenicOptions, nParts = 20) :
## Only 37% of the 1839 TFs in the database were found in the dataset. Do they use the same gene IDs?
以上代碼運行后,int目錄下生成相關(guān)不少中間結(jié)果:
? 1.2_corrMat.Rds:基因之間的相關(guān)性矩陣
? 1.3_GENIE3_weightMatrix_part_1.Rds等:GENIE3的中間結(jié)果
? 1.4_GENIE3_linkList.Rds:GENIE3最終結(jié)果海渊,是把“1.3_”開頭的文件合并在一起
1.4 推斷共表達模塊
接下來需要過濾低相關(guān)性的組合形成共表達基因集(模塊)绵疲。SCENIC提供了多種策略過濾低相關(guān)性TF-Target,建議是6種過濾標準都用
? w001:以每個TF為核心保留weight>0.001的基因形成共表達模塊切省;
? w005:以每個TF為核心保留weight>0.005的基因形成共表達模塊最岗;
? top50:以每個TF為核心保留weight值top50的基因形成共表達模塊;
? top5perTarget:每個基因保留weight值top5的TF得到精簡的TF-Target關(guān)聯(lián)表朝捆,然后把基因分配給TF構(gòu)建共表達模塊般渡;
? top10perTarget:每個基因保留weight值top10的TF得到精簡的TF-Target關(guān)聯(lián)表,然后把基因分配給TF構(gòu)建共表達模塊;
? top50perTarget:每個基因保留weight值top50的TF得到精簡的TF-Target關(guān)聯(lián)表驯用,然后把基因分配給TF構(gòu)建共表達模塊脸秽;
# 推斷共表達模塊(主要運行結(jié)果是int目錄下的1.6_tfModules_asDF.Rds)
scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)
## 15:50 Creating TF modules
## 75% 90%
## 0.002407983 0.004071417
## Number of links between TFs and targets: 2551778
## [,1]
## nTFs 689
## nTargets 7864
## nGeneSets 2803
## nLinks 3275940
經(jīng)過上述分析每個轉(zhuǎn)錄因子都找到了強相關(guān)的靶基因,接下來過濾共表達模塊形成有生物學意義的調(diào)控單元(regulons):
? 對每個共表達模塊進行motif富集分析蝴乔,保留顯著富集的motif(此項分析依賴gene-motif評分(排行)數(shù)據(jù)庫记餐,其行為基因、列為motif薇正、值為排名片酝,就是下載的cisTarget數(shù)據(jù)庫)
? 使用數(shù)據(jù)庫對motif進行TF注釋,注釋結(jié)果分高挖腰、低可信度 雕沿。數(shù)據(jù)庫直接注釋和同源基因推斷的TF是高可信結(jié)果,使用motif序列相似性注釋的TF是低可信結(jié)果猴仑。
? 用保留的motif對共表達模塊內(nèi)的基因進行打分(同樣依據(jù)cisTarget數(shù)據(jù)庫)审轮,識別顯著高分的基因(理解為motif離這些基因的TSS很近);
? 刪除共表達模塊內(nèi)與motif評分不高的基因辽俗,剩下的基因集稱為調(diào)控單元(regulon)
# 推斷轉(zhuǎn)錄調(diào)控網(wǎng)絡(luò)(regulon)
scenicOptions <- runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget")) #** Only for toy run!!
## 15:52 Step 2. Identifying regulons
## tfModulesSummary:
## top5perTarget
## 107
## 15:52 RcisTarget: Calculating AUC
## Using the column 'features' as feature index for the ranking database.
## Scoring database: [Source file: hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather]
## Using the column 'features' as feature index for the ranking database.
## Scoring database: [Source file: hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather]
## 15:53 RcisTarget: Adding motif annotation
## Using BiocParallel...
## Using BiocParallel...
## Number of motifs in the initial enrichment: 53899
## Number of motifs annotated to the matching TF: 1218
## 15:53 RcisTarget: Prunning targets
## Using the column 'features' as feature index for the ranking database.
## Using the column 'features' as feature index for the ranking database.
## 15:56 Number of motifs that support the regulons: 1218
## Preview of motif enrichment saved as: output/Step2_MotifEnrichment_preview.html
## Warning message:
## In RcisTarget::addLogo(tableSubset, motifCol = motifCol, dbVersion = dbVersion, :
## There is no annotation version attribute in the input table (it has probably been loaded with an older version of the package).'v9' will be used as it ## was the old default,but we recommend to re-load the annotations and/or re-run the enrichment to make sure everything is consistent.
★ 重要的結(jié)果保存在output文件夾:
1)Step2_MotifEnrichment.tsv:各個共表達模塊顯著富集motif的注釋信息
2)Step2_MotifEnrichment_preview.html
3)Step2_regulonTargetsInfo.tsv:最終的regulon文件(后續(xù)分析找到有價值的regulon疾渣,需要回到這個文件找對應(yīng)的轉(zhuǎn)錄因子和靶基因)
每個Regulon就是一個轉(zhuǎn)錄因子及其直接調(diào)控靶基因的基因集,接下來的工作就是對每個regulon在各個細胞中的活性評分崖飘。評分的基礎(chǔ)是基因的表達值榴捡,分數(shù)越高代表基因集的激活程度越高
# 計算細胞網(wǎng)絡(luò)評分
scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_filtered_log)
## 17:16 Step 3. Analyzing the network activity in each individual cell
## Number of regulons to evaluate on cells: 40
## Biggest (non-extended) regulons:
## XBP1 (267g)
## NFKB1 (96g)
## ELF1 (87g)
## FOSL1 (81g)
## FOS (79g)
## TEAD1 (61g)
## SPDEF (56g)
## CREB3L4 (37g)
## EGR1 (33g)
## IRF1 (31g)
## Quantiles for the number of genes detected by cell:
## (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
## min 1% 5% 10% 50% 100%
## 59 208 303 383 944 4139
## Using 15 cores.
## Warning in .AUCell_calcAUC(geneSets = geneSets, rankings = rankings, nCores = nCores, :
## Using only the first 208 genes (aucMaxRank) to calculate the AUC.
## Using 15 cores with doMC.
## 17:20 Finished running AUCell.
## 17:20 Plotting heatmap...
## 17:20 Plotting t-SNEs...
saveRDS(scenicOptions, file="int/scenicOptions.Rds") # To save status
★ runSCENIC_3_scoreCells()是一個多種功能打包的函數(shù),它包含的子功能有:
1)計算regulon在每個細胞中AUC值朱浴,最后得到一個以regulon為行細胞為列的矩陣薄疚。結(jié)果:int/3.4_regulonAUC.Rds
2)計算每個regulon的AUC閾值,細胞中regulonAUC值>閾值赊琳,代表此regulon在細胞中處于激活狀態(tài),否則代表沉默狀態(tài)砰碴。結(jié)果:int/3.5_AUCellThresholds.Rds
3)使用regulonAUC矩陣對細胞進行降維聚類
4)用heatmap圖展示regulonAUC矩陣躏筏,用t-SNE圖分別展示每個regulon的活性分布情況。結(jié)果:output/Step3_開頭的系列文件
二進制轉(zhuǎn)換及衍生分析(將regulonAUC矩陣轉(zhuǎn)換為二進制矩陣后呈枉,會重新降維聚類趁尼,輸出的結(jié)果與runSCENIC_3_scoreCells()類似)
scenicOptions <- runSCENIC_4_aucell_binarize(scenicOptions)
## Binary regulon activity: 28 TF regulons x 26877 cells.
## (40 regulons including 'extended' versions)
## 28 regulons are active in more than 1% (268.77) cells.
1.5 調(diào)整閾值(可選)
#使用shiny互動調(diào)整閾值
aucellApp <- plotTsne_AUCellApp(scenicOptions, exprMat_all)
savedSelections <- shiny::runApp(aucellApp)
#保存調(diào)整后的閾值
newThresholds <- savedSelections$thresholds
scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
1.6 SCENIC結(jié)果可視化
runSCENIC_3_scoreCells()和runSCENIC_4_aucell_binarize()運行之后會生成一些可視化的heatmap圖與tSNE圖,但不美觀猖辫,不宜與seurat聯(lián)系起來酥泞,故可重新自行做圖
##導入原始regulonAUC矩陣
AUCmatrix <- readRDS("int/3.4_regulonAUC.Rds")
AUCmatrix <- AUCmatrix@assays@data@listData$AUC
AUCmatrix <- data.frame(t(AUCmatrix), check.names=F)
RegulonName_AUC <- colnames(AUCmatrix)
RegulonName_AUC <- gsub(' \\(','_',RegulonName_AUC)
RegulonName_AUC <- gsub('\\)','',RegulonName_AUC)
colnames(AUCmatrix) <- RegulonName_AUC
sub_Epithelialauc <- AddMetaData(sub_Epithelial, AUCmatrix)
sub_Epithelialauc@assays$integrated <- NULL
saveRDS(sub_Epithelialauc,'sub_Epithelialauc.rds')
##導入二進制regulonAUC矩陣
BINmatrix <- readRDS("int/4.1_binaryRegulonActivity.Rds")
BINmatrix <- data.frame(t(BINmatrix), check.names=F)
RegulonName_BIN <- colnames(BINmatrix)
RegulonName_BIN <- gsub(' \\(','_',RegulonName_BIN)
RegulonName_BIN <- gsub('\\)','',RegulonName_BIN)
colnames(BINmatrix) <- RegulonName_BIN
sub_Epithelialbin <- AddMetaData(sub_Epithelial, BINmatrix)
sub_Epithelialbin@assays$integrated <- NULL
saveRDS(sub_Epithelialbin, 'sub_Epithelialbin.rds')
##利用Seurat可視化AUC
dir.create('scenic_seurat')
#FeaturePlot
p1 = FeaturePlot(sub_Epithelialauc, features='CD59_extended_34g', label=T, reduction = 'umap')
p2 = FeaturePlot(sub_Epithelialbin, features='CD59_extended_34g', label=T, reduction = 'umap')
p3 = DimPlot(sub_Epithelial, reduction = 'umap', group.by = "subtype", label=T)
p1|p2|p3
# RidgePlot&VlnPlot
p1 = RidgePlot(sub_Epithelialauc, features = "CD59_extended_34g", group.by="subtype") +
theme(legend.position='none')
p2 = VlnPlot(sub_Epithelialauc, features = "CD59_extended_34g", pt.size = 0, group.by="subtype") +
theme(legend.position='none')
p1 + p2
library(pheatmap)
cellInfo <- readRDS("int/cellInfo.Rds")
subtype = subset(cellInfo,select = 'subtype')
AUCmatrix <- t(AUCmatrix)
BINmatrix <- t(BINmatrix)
#挑選部分感興趣的regulons
my.regulons <- c('CD59_extended_34g','CEBPD_extended_14g','CHD2_extended_668g','TEAD1_extended_68g','TEAD1_61g',
'JUNB_extended_24g','CREB3L4_37g','SPDEF_56g','CREB3L4_extended_124g','IRF1_31g','IRF1_extended_40g')
myAUCmatrix <- AUCmatrix[rownames(AUCmatrix)%in%my.regulons,]
myBINmatrix <- BINmatrix[rownames(BINmatrix)%in%my.regulons,]
#使用regulon原始AUC值繪制熱圖
pheatmap(myAUCmatrix, show_colnames=F, annotation_col=subtype)
#使用regulon二進制AUC值繪制熱圖
pheatmap(myBINmatrix, show_colnames=F, annotation_col=subtype,
color = colorRampPalette(colors = c("white","black"))(100))
參考教程:
- 單細胞轉(zhuǎn)錄組高級分析二:轉(zhuǎn)錄調(diào)控網(wǎng)絡(luò)分析(https://cloud.tencent.com/developer/article/1692240)