AUC允許在單細胞rna數(shù)據(jù)集中識別具有活性基因集(如gene signature,gene module)的細胞。AUCell使用曲線下面積來計算輸入基因集的一個關(guān)鍵子集是否在每個細胞的表達基因中富集。AUCell和GSVA的算法是一回事蹋岩,都是排序蛔糯。AUCell借鑒了ssGSVA的算法这揣,但是在排序的時候攻柠,它們的有些處理不太一樣赃磨。在SCENIC 轉(zhuǎn)錄因子預(yù)測的分析過程中立由,已經(jīng)封裝了AUCell轧钓。
AUCell的工作流程分為三步,分別通過三個函數(shù)完成:
1. AUCell_buildRankings
:Build the rank
2. AUCell_calcAUC
:Calculate the Area Under the Curve (AUC)
3. AUCell_exploreThresholds
:Set the assignment thresholds
1. 數(shù)據(jù)準備
- 1.1 R包下載
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG","doSNOW"))
# For the main example:
BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
"dynamicTreeCut","R2HTML","Rtsne", "zoo"))
browseVignettes("AUCell")
- 1.2 數(shù)據(jù)準備:AUCell所需要的輸入數(shù)據(jù)分為兩部分锐膜,
矩陣數(shù)據(jù)
和基因集數(shù)據(jù)
聋迎。
#設(shè)置工作目錄
dir.create("AUCell_tutorial")
setwd("AUCell_tutorial") # or in the first code chunk (kntr options), if running as Notebook
下載演示數(shù)據(jù)
library(GEOquery)
attemptsLeft <- 20
while(attemptsLeft>0)
{
geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity)
if(methods::is(geoFile,"error"))
{
attemptsLeft <- attemptsLeft-1
Sys.sleep(5)
}
else
attemptsLeft <- 0
}
gzFile <- grep(".txt.gz", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)
message(gzFile)
#GSE60361_C1-3005-Expression.txt.gz
txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)
message(txtFile)
#GSE60361_C1-3005-Expression.txt
gunzip(filename=gzFile, destname=txtFile, remove=TRUE)
library(data.table)
geoData <- fread(txtFile, sep="\t")
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
exprMatrix <- as.matrix(geoData[,-1, with=FALSE])
rm(geoData)
dim(exprMatrix)
# [1] 19972 3005
rownames(exprMatrix) <- geneNames
exprMatrix[1:5,1:4]
# 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06
# Tspan12 0 0 0 3
# Tshz1 3 1 0 2
# Fnbp1l 3 1 6 4
# Adamts15 0 0 0 0
# Cldn12 1 1 1 0
# Remove file
file.remove(txtFile)
# [1] TRUE
# Save for future use
mouseBrainExprMatrix <- exprMatrix
save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")
# 僅使用數(shù)據(jù)集中的5000個基因做演示
# load("exprMatrix_AUCellVignette_MouseBrain.RData")
set.seed(333)
exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]
dim(exprMatrix)
# [1] 5000 3005
準備輸入基因集
library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file('examples', package='AUCell')), "geneSignatures.gmt", sep="/")
geneSets <- getGmt(gmtFile)
# check how many of these genes are in the expression matrix:
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix))
cbind(nGenes(geneSets)) #演示基因集中有6個基因集 后面的數(shù)字是基因集中對應(yīng)的基因數(shù)目
## [,1]
## Astrocyte_Cahoy 538
## Neuron_Cahoy 384
## Oligodendrocyte_Cahoy 477
## Astrocyte_Lein 7
## Neuron_Lein 13
## Microglia_lavin 165
# 為了方便演示,將基因集中的基因數(shù)目添加到基因集名字中
geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))
# 添加一些隨機抽樣出的基因組成的的基因集和在多數(shù)細胞中表達的基因(如house-keeping like gene)枣耀。(僅供演示霉晕,實操不必這樣做)
# Random
set.seed(321)
extraGeneSets <- c(
GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),
GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))
countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))
# Housekeeping-like
extraGeneSets <- c(extraGeneSets,
GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))
geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))
names(geneSets)
## [1] "Astrocyte_Cahoy (538g)" "Neuron_Cahoy (384g)"
## [3] "Oligodendrocyte_Cahoy (477g)" "Astrocyte_Lein (7g)"
## [5] "Neuron_Lein (13g)" "Microglia_lavin (165g)"
## [7] "Random (50g)" "Random (500g)"
## [9] "HK-like (100g)"
2. Build gene-expression rankings for each cell (AUCell_buildRankings
函數(shù))
輸入的是表達矩陣
cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)
# Quantiles for the number of genes detected by cell:
# (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
# min 1% 5% 10% 50% 100%
# 191 270 369 445 929 2062
查看排列的結(jié)果
cells_rankings
# Ranking for 5000 genes (rows) and 3005 cells (columns).
# Top-left corner of the ranking:
# cells
# genes 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 1772067065_H06
# 0610009B22Rik 3417 859 672 648 1822
# 0610009L18Rik 4196 4453 2603 3537 4642
# 0610010B08Rik_loc6 1497 1645 4543 2843 1646
# 0610011F06Rik 355 861 1227 2095 847
# 0610012H03Rik 2331 4084 2533 3098 4702
# 0610039K10Rik 3390 2443 4490 2202 4239
# Quantiles for the number of genes detected by cell:
# min 1% 5% 10% 50% 100%
# 191 270 369 445 929 2062
這一步的排名是對每個基因從高到低進行排名,也就是對每一個細胞的每一個基因進行排名捞奕,得到一個排名的矩陣牺堰。
3. Calculate the Area Under the Curve (AUC)(AUCell_calcAUC
函數(shù))
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
save(cells_AUC, file="cells_AUC.RData")
cells_AUC
# AUC for 9 gene sets (rows) and 3005 cells (columns).
# Top-left corner of the AUC matrix:
# cells
# gene sets 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 1772067065_H06
# Astrocyte_Cahoy 0.14705221 0.13297992 0.15524498 0.16514056 0.14599197
# Neuron_Cahoy 0.27110040 0.29031325 0.29326908 0.31334940 0.32883534
# Oligodendrocyte_Cahoy 0.15183936 0.13073092 0.13898795 0.12356627 0.13783133
# Astrocyte_Lein 0.15389082 0.13995354 0.21777003 0.24332172 0.10046458
# Neuron_Lein 0.37828427 0.41278886 0.50743906 0.46438746 0.45868946
# Microglia_lavin 0.09373979 0.06140446 0.06481582 0.08372346 0.07410633
根據(jù)輸入基因list中的基因在每個細胞中的表達量,對每個細胞都算了一個該genelist的AUC值颅围。
因為我們是直接輸入的矩陣伟葫,如果矩陣是來自于單細胞數(shù)據(jù),這時對于每個輸入的list院促,每個細胞都已經(jīng)有一個auc值筏养,可以將auc值投射回umap或tsne圖(見6.實操)。
4. Determine the cells with the given gene signatures or active gene sets (AUCell_exploreThresholds
函數(shù))
AUC估計了輸入基因集中的基因在每個細胞高表達的比例常拓。表達了輸入基因集中較多基因的細胞具有較高的AUC值渐溶。由于AUC值可以代表細胞表達基因集中基因的比例,我們可以使用跨細胞的相對AUCs來查看單細胞轉(zhuǎn)錄組的細胞對輸入基因集的響應(yīng)情況弄抬。
由于AUC并不是一個絕對的數(shù)值茎辐,而是受到細胞類型、數(shù)據(jù)集、基因集等的影響拖陆,尤其是在單細胞水平弛槐,大多數(shù)基因并不在所有細胞中穩(wěn)定表達。因此有時判斷某個基因集的gene在一個細胞中是'on'還是'off'狀態(tài)比較困難依啰。最理想的方法是雙峰分布bi-modal distribution乎串,也就是數(shù)據(jù)集中部分細胞有較高AUC而大多數(shù)的細胞有較低的AUC。AUCell提供了AUCell_exploreThresholds()
函數(shù)來區(qū)分較高和較低的AUC值速警,也就是自動劃定雙峰分布的閾值灌闺,判斷基因集的'on'或'off'狀態(tài)。
set.seed(123)
par(mfrow=c(3,3))
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)
getAUC(cells_AUC)[,1:5]
# cells
# gene sets 1772071015_C02 1772071017_G12 # 1772071017_A05 1772071014_B06 1772067065_H06
# Astrocyte_Cahoy (538g) 0.14621687 0.132208835 0.15315663 0.163823293 0.147855422
# Neuron_Cahoy (384g) 0.27293173 0.291887550 0.29529317 0.312481928 0.328899598
# Oligodendrocyte_Cahoy (477g) 0.15752610 0.130120482 0.13937349 0.124144578 0.134650602
# Astrocyte_Lein (7g) 0.14982578 0.145180023 0.21544715 0.242160279 0.103368177
# Neuron_Lein (13g) 0.38524850 0.404558405 0.51187085 0.467553023 0.446660336
# Microglia_lavin (165g) 0.09196153 0.062928688 0.06546906 0.082852477 0.074578116
# Random (50g) 0.01273942 0.008908686 0.01336303 0.008819599 0.003296214
# Random (500g) 0.10306827 0.102522088 0.11167871 0.113959839 0.067726908
# HK-like (100g) 0.40626566 0.427769424 0.41182957 0.419548872 0.403208020
warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)
warningMsg[which(warningMsg!="")]
# Random (500g)
# "The AUC might follow a normal distribution (random gene-set?). The global distribution overlaps the partial distributions. "
AUCell_exploreThresholds函數(shù)計算的每個基因集的閾值儲存在$aucThr
坏瞄。如查看oligodendrocyte基因集的閾值:
cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
## threshold nCells
## Global_k1 0.2419241 739
## L_k2 0.2437017 733
## R_k3 0.1766730 1008
## minimumDens 0.2417192 740
cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
## minimumDens
## 0.2417192
查看在這個閾值下處于on狀態(tài)的細胞數(shù)目
oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
length(oligodencrocytesAssigned)
# [1] 740
head(oligodencrocytesAssigned)
## [1] "1772067069_B02" "1772066077_G03"
## [3] "1772066070_D08" "1772067076_A07"
## [5] "1772067057_G07" "1772067076_C07"
也可以手動對閾值進行調(diào)整
繪制特定基因集的條型圖桂对,設(shè)定新閾值
geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)
abline(v=0.25)
使用新閾值分配細胞
newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))
length(newSelectedCells)
# [1] 3002
head(newSelectedCells)
# [1] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06"
# [5] "1772067065_H06" "1772071017_E02"
5. 后續(xù)分析
加載3005個細胞的tsne坐標并繪圖
load(paste(file.path(system.file('examples', package='AUCell')), "cellsTsne.RData", sep="/"))
cellsTsne <- cellsTsne$Y
plot(cellsTsne, pch=16, cex=.3)
tsne數(shù)據(jù)的來源:
load("exprMatrix_AUCellVignette_MouseBrain.RData")
sumByGene <- apply(mouseBrainExprMatrix, 1, sum)
exprMatSubset <- mouseBrainExprMatrix[which(sumByGene>0),]
logMatrix <- log2(exprMatSubset+1)
library(Rtsne)
set.seed(123)
cellsTsne <- Rtsne(t(logMatrix))
rownames(cellsTsne$Y) <- colnames(logMatrix)
colnames(cellsTsne$Y) <- c("tsne1", "tsne2")
save(cellsTsne, file="cellsTsne.RData")
上面的tSNE圖可以通過AUC評分來上色。為了更好的區(qū)分細胞鸠匀,我們將閾值內(nèi)的細胞標為pink-red蕉斜,其他細胞標為black-blue。
selectedThresholds <- getThresholdSelected(cells_assignment)
par(mfrow=c(2,4)) # Splits the plot into two rows and three columns
for(geneSetName in names(selectedThresholds))
{
nBreaks <- 5 # Number of levels in the color palettes
# Color palette for the cells that do not pass the threshold
colorPal_Neg <- grDevices::colorRampPalette(c("black","blue", "skyblue"))(nBreaks)
# Color palette for the cells that pass the threshold
colorPal_Pos <- grDevices::colorRampPalette(c("pink", "magenta", "red"))(nBreaks)
# Split cells according to their AUC value for the gene set
passThreshold <- getAUC(cells_AUC)[geneSetName,] > selectedThresholds[geneSetName]
if(sum(passThreshold) >0 )
{
aucSplit <- split(getAUC(cells_AUC)[geneSetName,], passThreshold)
# Assign cell color
cellColor <- c(setNames(colorPal_Neg[cut(aucSplit[[1]], breaks=nBreaks)], names(aucSplit[[1]])),
setNames(colorPal_Pos[cut(aucSplit[[2]], breaks=nBreaks)], names(aucSplit[[2]])))
# Plot
plot(cellsTsne, main=geneSetName,
sub="Pink/red cells pass the threshold",
col=cellColor[rownames(cellsTsne)], pch=16)
}
}
查看閾值對細胞分配的影響
selectedThresholds[2] <- 0.25
par(mfrow=c(2,3))
AUCell_plotTSNE(tSNE=cellsTsne, exprMat=exprMatrix,
cellsAUC=cells_AUC[1:2,], thresholds=selectedThresholds)
6. 使用pbmc3k數(shù)據(jù)集進行實操
library(AUCell)
library(clusterProfiler)
library(ggplot2)
library(Seurat)
pbmc <- readRDS("pbmc.rds")
H <- read.gmt("h.all.v7.4.symbols.gmt.txt")
cells_rankings <- AUCell_buildRankings(pbmc@assays$RNA@data) # 關(guān)鍵一步
unique(H$term)
geneSets <- lapply(unique(H$term), function(x){H$gene[H$term == x]})
names(geneSets) <- unique(H$term)
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings, aucMaxRank=nrow(cells_rankings)*0.1)
length(rownames(cells_AUC@assays@data$AUC))
grep("INFLAMMATORY",rownames(cells_AUC@assays@data$AUC),value = T)
# [1] "HALLMARK_INFLAMMATORY_RESPONSE"
geneSet <- "HALLMARK_INFLAMMATORY_RESPONSE"
aucs <- as.numeric(getAUC(cells_AUC)[geneSet, ])
pbmc$AUC <- aucs
df<- data.frame(pbmc@meta.data, pbmc@reductions$umap@cell.embeddings)
head(df)
class_avg <- df %>%
group_by(cell_type) %>%
summarise(
UMAP_1 = median(UMAP_1),
UMAP_2 = median(UMAP_2)
)
ggplot(df, aes(UMAP_1, UMAP_2))+
geom_point(aes(colour= AUC)) +
viridis::scale_color_viridis(option="A") +
ggrepel::geom_label_repel(aes(label = cell_type),
data = class_avg,
size = 6,
label.size = 0,
segment.color = NA)+
theme(legend.position = "none") +
theme_bw()
參考:
http://127.0.0.1:28753/library/AUCell/doc/AUCell.html
http://www.reibang.com/p/2fb20f44da67