使用AUCell識(shí)別單細(xì)胞rna數(shù)據(jù)中具有活性“基因集”(i.e. gene signatures)的細(xì)胞。AUCell使用“曲線下面積”(Area Under the Curve,AUC)來(lái)計(jì)算輸入基因集的一個(gè)關(guān)鍵子集是否在每個(gè)細(xì)胞的表達(dá)基因中富集。AUC分?jǐn)?shù)在所有細(xì)胞的分布允許探索signatures的相對(duì)表達(dá)。
AUCell允許在單細(xì)胞rna數(shù)據(jù)中識(shí)別具有活性基因集(如簽名、基因模塊)的細(xì)胞湿硝。簡(jiǎn)言之,運(yùn)行AUCell的工作流基于三個(gè)步驟:
- Build the rankings
- Calculate the Area Under the Curve (AUC)
- Set the assignment thresholds
其實(shí)我們發(fā)現(xiàn)在SCENIC 包的分析過程中润努,已經(jīng)封裝了AUCell关斜。在單細(xì)胞數(shù)據(jù)的下游分析中往往聚焦于某個(gè)有意思的基因集(gene set),已經(jīng)發(fā)展出許多的富集方法铺浇。但是大部分的富集分析往往都是單向的:輸入基因集輸出通路(生物學(xué)意義)痢畜,但是很少有可以從基因集富集信息反饋到樣本上來(lái)的。AUCell在做這樣的嘗試。
應(yīng)用案例可以參考:
下面通過一個(gè)簡(jiǎn)短的示例來(lái)說(shuō)明它是如和運(yùn)作的丁稀。
library(AUCell)
library(clusterProfiler)
library(ggplot2)
library(Seurat)
library(SeuratData)
##seurat RDS object
cd8.seurat <- pbmc3k.final
cells_rankings <- AUCell_buildRankings(cd8.seurat@assays$RNA@data) # 關(guān)鍵一步
# ?AUCell_buildRankings
##load gene set, e.g. GSEA lists from BROAD
c5 <- read.gmt("c5.cc.v7.1.symbols.gmt") ## ALL GO tm
這個(gè)c5.cc.v7.1.symbols.gmt
文件可以在GSEA上面下載吼拥,要做下游分析通路是要會(huì)背的。
這里記錄的是每個(gè)通路及其對(duì)應(yīng)的基因集:
> head(c5$ont)
[1] GO_FILOPODIUM GO_FILOPODIUM GO_FILOPODIUM GO_FILOPODIUM GO_FILOPODIUM GO_FILOPODIUM
999 Levels: GO_FILOPODIUM GO_NUCLEAR_CHROMOSOME_TELOMERIC_REGION GO_CLATHRIN_SCULPTED_VESICLE GO_I_BAND ... GO_PROXIMAL_DENDRITE
> head(c5$gene)
[1] "ABI1" "ARL4C" "ABI2" "FARP1" "CDK5" "TUBB3"
geneSets <- lapply(unique(c5$ont), function(x){print(x);c5$gene[c5$ont == x]})
names(geneSets) <- unique(c5$ont)
此時(shí)线衫,我們可以根據(jù)GO通路找到對(duì)應(yīng)的基因:
geneSets$GO_I_BAND
[1] "DNAJB6" "MYZAP" "STUB1" "MYL12B" "MYL9" "NEBL" "GLRX3" "PDLIM5" "CFL2" "FERMT2" "LDB3"
[12] "AHNAK2" "SYNPO" "FBXO32" "C10orf71" "MYOM3" "XIRP2" "KLHL40" "CRYAB" "CSRP1" "ADRA1A" "CTNNB1"
[23] "DES" "SYNPO2" "DMD" "SMTNL1" "ALDOA" "FHL2" "FHL3" "FKBP1A" "FKBP1B" "PALLD" "FLNA"
[34] "FLNB" "FLNC" "SYNE2" "OBSL1" "FRG1" "FBXO22" "ANKRD2" "ITGB1BP2" "ANKRD1" "PDLIM3" "BMP10"
[45] "BIN1" "FBXL22" "ANK1" "ANK2" "ANK3" "PARVB" "PRICKLE4" "HRC" "HSPB1" "KY" "CAVIN4"
[56] "JUP" "KCNA5" "KCNE1" "KCNN2" "KRT8" "KRT19" "MTM1" "MYH6" "MYH7" "MYL3" "PPP1R12A"
[67] "PPP1R12B" "NEB" "NOS1" "NRAP" "ATP2B4" "PAK1" "PDE4B" "MYOZ2" "PGM5" "PPP2R5A" "PPP3CA"
[78] "PPP3CB" "PARVA" "SCN3B" "PSEN1" "PSEN2" "JPH1" "JPH2" "TRIM54" "MYOZ1" "RYR1" "RYR2"
[89] "RYR3" "S100A1" "SCN1A" "SCN5A" "SCN8A" "PDLIM2" "SLC2A1" "SLC4A1" "SLC8A1" "SMN1" "SMN2"
[100] "DST" "SRI" "ACTC1" "TTN" "CACNA1C" "CACNA1D" "CACNA1S" "SYNPO2L" "FHOD3" "CSRP3" "ACTN4"
[111] "SYNC" "CAPN3" "OBSCN" "CASQ1" "CASQ2" "MYPN" "TRIM63" "SORBS2" "MYO18B" "TCAP" "PDLIM4"
[122] "CAV3" "ACTN1" "MYOM1" "FBP2" "ACTN2" "KAT2B" "AKAP4" "IGFN1" "PDLIM1" "NEXN" "MYOM2"
[133] "MYOZ3" "PDLIM7" "HOMER1" "FHL5" "MYOT" "BAG3" "NOS1AP" "HDAC4"
計(jì)算AUC:
?AUCell_calcAUC
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings, aucMaxRank=nrow(cells_rankings)*0.1)
找一些通路(該找哪些通路呢凿可?)
> length(rownames(cells_AUC@assays@data$AUC))
[1] 958
> grep("REG",rownames(cells_AUC@assays@data$AUC),value = T)
[1] "GO_NUCLEAR_CHROMOSOME_TELOMERIC_REGION" "GO_CHROMOSOME_CENTROMERIC_REGION"
[3] "GO_CYTOPLASMIC_REGION" "GO_PROTEASOME_REGULATORY_PARTICLE_BASE_SUBCOMPLEX"
[5] "GO_PERINUCLEAR_REGION_OF_CYTOPLASM" "GO_CHROMOSOMAL_REGION"
[7] "GO_CELL_CORTEX_REGION" "GO_PARANODE_REGION_OF_AXON"
[9] "GO_CONDENSED_CHROMOSOME_CENTROMERIC_REGION" "GO_CONDENSED_NUCLEAR_CHROMOSOME_CENTROMERIC_REGION"
[11] "GO_PLASMA_MEMBRANE_REGION" "GO_CHROMOSOME_TELOMERIC_REGION"
[13] "GO_MEMBRANE_REGION" "GO_PROTEASOME_REGULATORY_PARTICLE_LID_SUBCOMPLEX"
[15] "GO_MYELIN_SHEATH_ABAXONAL_REGION" "GO_MYELIN_SHEATH_ADAXONAL_REGION"
[17] "GO_JUXTAPARANODE_REGION_OF_AXON" "GO_REGION_OF_CYTOSOL"
[19] "GO_CENTRAL_REGION_OF_GROWTH_CONE"
我們選擇GO_PERINUCLEAR_REGION_OF_CYTOPLASM
。
> ##set gene set of interest here for plotting
> geneSet <- "GO_PERINUCLEAR_REGION_OF_CYTOPLASM"
> aucs <- as.numeric(getAUC(cells_AUC)[geneSet, ])
> cd8.seurat$AUC <- aucs
> df<- data.frame(cd8.seurat@meta.data, cd8.seurat@reductions$umap@cell.embeddings)
> head(df)
orig.ident nCount_RNA nFeature_RNA seurat_annotations percent.mt RNA_snn_res.0.5 seurat_clusters AUC UMAP_1
AAACATACAACCAC pbmc3k 2419 779 Memory CD4 T 3.0177759 1 1 0.06102966 -4.232792
AAACATTGAGCTAC pbmc3k 4903 1352 B 3.7935958 3 3 0.08560310 -4.892886
AAACATTGATCAGC pbmc3k 3147 1129 Memory CD4 T 0.8897363 1 1 0.08328099 -5.508639
AAACCGTGCTTCCG pbmc3k 2639 960 CD14+ Mono 1.7430845 2 2 0.07669723 11.332233
AAACCGTGTATGCG pbmc3k 980 521 NK 1.2244898 6 6 0.06250478 -7.450703
AAACGCACTGGTAC pbmc3k 2163 781 Memory CD4 T 1.6643551 1 1 0.08204075 -3.509504
UMAP_2
AAACATACAACCAC -4.152139
AAACATTGAGCTAC 10.985685
AAACATTGATCAGC -7.211088
AAACCGTGCTTCCG 3.161727
AAACCGTGTATGCG 1.092022
AAACGCACTGGTAC -6.087042
我們看到每個(gè)細(xì)胞現(xiàn)在都加上AUC值了授账,下面做一下可視化枯跑。
class_avg <- df %>%
group_by( seurat_annotations) %>%
summarise(
UMAP_1 = median(UMAP_1),
UMAP_2 = median(UMAP_2)
)
ggplot(df, aes(UMAP_1, UMAP_2)) +
geom_point(aes(colour = AUC)) + viridis::scale_color_viridis(option="A") +
ggrepel::geom_label_repel(aes(label = seurat_annotations),
data = class_avg,
size = 6,
label.size = 0,
segment.color = NA
)+ theme(legend.position = "none") + theme_bw()
關(guān)鍵是,你要找到基因集啊白热。
https://bioconductor.org/packages/release/bioc/html/AUCell.html