前情回顧
Seurat 4.0 ||單細(xì)胞數(shù)據(jù)分析工具箱有更新
Seurat 4.0 ||單細(xì)胞多模態(tài)數(shù)據(jù)整合算法WNN
Seurat 4.0 || 分析scRNA和表面抗體數(shù)據(jù)
單細(xì)胞數(shù)據(jù)數(shù)據(jù)正在走向統(tǒng)一:在一個細(xì)胞內(nèi)同時測量分屬中心法則不同階段的多種數(shù)據(jù)類型昔瞧。這需要我們開發(fā)新的方法來將這些數(shù)據(jù)整合在一起以描繪一個完整的細(xì)胞狀態(tài)典奉。Seurat升級到4.0以后驹吮,在一個Seurat對象中可以存儲(數(shù)據(jù)結(jié)構(gòu))和計算(算法)單細(xì)胞多模態(tài)數(shù)據(jù)虱饿。本文我們跟著官方教程演示使用WNN分析多模態(tài)技術(shù)分析10xscRNA+ATAC數(shù)據(jù)。使用的數(shù)據(jù)集在10x網(wǎng)站上公開标捺,是為10412個同時測量轉(zhuǎn)錄組和ATAC的PBMCs細(xì)胞套才。
在這個例子中漠趁,我們將演示:
- 使用成對的轉(zhuǎn)錄組和ATAC-seq數(shù)據(jù)創(chuàng)建多模式Seurat對象
- 對單細(xì)胞RNA+ATAC數(shù)據(jù)進行加權(quán)近鄰聚類
- 利用這兩種模式來識別不同細(xì)胞類型和狀態(tài)的調(diào)節(jié)因子
你可以從10x dataset網(wǎng)站下載數(shù)據(jù)集。請務(wù)必下載以下文件:
Filtered feature barcode matrix (HDF5)
ATAC Per fragment information file (TSV.GZ)
ATAC Per fragment information index (TSV.GZ index) # 這個索引文件別忘了哦
網(wǎng)址:https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k?
最后盾沫,為了運行本示例裁赠,請確保安裝了以下軟件包:
- Seurat v4.
- Signac for the analysis of single-cell chromatin datasets
- EnsDb.Hsapiens.v86 for a set of annotations for hg38
- dplyr to help manipulate data tables
library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86)
library(dplyr)
library(ggplot2)
我們將基于基因表達(dá)數(shù)據(jù)創(chuàng)建一個Seurat對象,然后添加ATAC-seq數(shù)據(jù)作為第二個assay赴精。關(guān)于創(chuàng)建和處理ATAC對象的更多信息佩捞,可以瀏覽Signac相關(guān)文檔。
# the 10x hdf5 file contains both data types.
inputdata.10x <- Read10X_h5("../data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
# extract RNA and ATAC data
rna_counts <- inputdata.10x$`Gene Expression`
atac_counts <- inputdata.10x$Peaks
# Create Seurat object
pbmc <- CreateSeuratObject(counts = rna_counts)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACAGCCAAGGAATC-1 SeuratProject 8380 3308 7.470167
AAACAGCCAATCCCTT-1 SeuratProject 3771 1896 10.527711
AAACAGCCAATGCGCT-1 SeuratProject 6876 2904 6.457243
AAACAGCCACACTAAT-1 SeuratProject 1733 846 18.003462
AAACAGCCACCAACCG-1 SeuratProject 5415 2282 6.500462
AAACAGCCAGGATAAC-1 SeuratProject 2759 1353 6.922798
VlnPlot(pbmc,features=c('percent.mt','nCount_RNA', 'nFeature_RNA'),pt.size = 0,cols = "red")
添加ATAC數(shù)據(jù)蕾哟。
# Now add in the ATAC-seq data
# we'll only use peaks in standard chromosomes
grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
atac_counts <- atac_counts[as.vector(grange.use), ]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"
frag.file <- "/home/data/vip06/dataset/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"
chrom_assay <- CreateChromatinAssay(
counts = atac_counts,
sep = c(":", "-"),
genome = 'hg38',
fragments = frag.file,
min.cells = 10,
annotation = annotations
)
pbmc[["ATAC"]] <- chrom_assay
查看數(shù)據(jù)集對象:
pbmc[["ATAC"]]
ChromatinAssay data with 105913 features for 11909 cells
Variable features: 0
Genome: hg38
Annotation present: TRUE
Motifs present: FALSE
Fragment files: 1
pbmc
An object of class Seurat
142514 features across 11909 samples within 2 assays
Active assay: RNA (36601 features, 0 variable features)
1 other assay present: ATAC
我們根據(jù)每種模式檢測到的分子數(shù)以及線粒體百分比來執(zhí)行基本的QC一忱,其實這一步什么也沒有Q掉,只是獲得數(shù)據(jù)概覽谭确,了解數(shù)據(jù)的質(zhì)量分布帘营。
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA percent.mt nCount_ATAC nFeature_ATAC
AAACAGCCAAGGAATC-1 SeuratProject 8380 3308 7.470167 55550 13867
AAACAGCCAATCCCTT-1 SeuratProject 3771 1896 10.527711 20485 7247
AAACAGCCAATGCGCT-1 SeuratProject 6876 2904 6.457243 16674 6528
AAACAGCCACACTAAT-1 SeuratProject 1733 846 18.003462 2007 906
AAACAGCCACCAACCG-1 SeuratProject 5415 2282 6.500462 7658 3323
AAACAGCCAGGATAAC-1 SeuratProject 2759 1353 6.922798 10355 4267
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "nCount_ATAC", "nFeature_ATAC"), ncol = 3,
log = TRUE, pt.size = 0) + NoLegend()
如果是多個樣本這樣的展示方式可以看到每個樣本的質(zhì)量差異,咦逐哈,WNN框架下如何執(zhí)行多樣本分析仪吧?這里我們還是假設(shè)有那么一組吧,人為設(shè)置一個分組鞠眉。
pbmc$replicate <- sample(c("ctrl", "para"), size = ncol(pbmc), replace = TRUE)
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "nCount_ATAC", "nFeature_ATAC"), ncol = 3,group.by = 'replicate',
log = TRUE, pt.size = 0) + NoLegend()
看到數(shù)據(jù)的質(zhì)量分布之后薯鼠,我們可以執(zhí)行真正的質(zhì)控了择诈,把異常值或者不是細(xì)胞的barcode去掉,保證進入分析的都是細(xì)胞出皇,而非雜質(zhì)羞芍。
pbmc <- subset(
x = pbmc,
subset = nCount_ATAC < 7e4 &
nCount_ATAC > 5e3 &
nCount_RNA < 25000 &
nCount_RNA > 1000 &
percent.mt < 20
)
接下來,我們使用RNA和ATAC-seq數(shù)據(jù)的標(biāo)準(zhǔn)方法郊艘,分別對這兩種檢測方法進行預(yù)處理和降維荷科。
# RNA analysis
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:50, reduction.name = 'umap.rna', reduction.key = 'rnaUMAP_')
PC_ 1
Positive: VCAN, PLXDC2, SAT1, SLC8A1, NAMPT, NEAT1, ZEB2, DPYD, LRMDA, LYZ
FCN1, LYN, ANXA1, JAK2, CD36, AOAH, ARHGAP26, TYMP, RBM47, PLCB1
PID1, LRRK2, ACSL1, LYST, IRAK3, DMXL2, MCTP1, AC020916.1, GAB2, TBXAS1
Negative: EEF1A1, RPL13, BCL2, RPS27, RPL13A, LEF1, BCL11B, BACH2, INPP4B, RPL41
IL32, RPS27A, CAMK4, IL7R, RPL3, SKAP1, RPS2, GNLY, LTB, CD247
RPS12, RPS18, RPL11, PRKCH, ANK3, RPL23A, THEMIS, CCL5, RPL10, RPS3
PC_ 2
Positive: BANK1, IGHM, AFF3, IGKC, RALGPS2, MS4A1, PAX5, CD74, EBF1, FCRL1
CD79A, OSBPL10, LINC00926, COL19A1, BLK, NIBAN3, IGHD, CD22, HLA-DRA, ADAM28
CD79B, PLEKHG1, COBLL1, AP002075.1, LIX1-AS1, CCSER1, TCF4, BCL11A, GNG7, STEAP1B
Negative: GNLY, CCL5, NKG7, CD247, IL32, PRKCH, PRF1, BCL11B, INPP4B, LEF1
IL7R, CAMK4, THEMIS, GZMA, RORA, GZMH, TXK, TC2N, TRBC1, PITPNC1
SYNE2, PDE3B, STAT4, TRAC, TGFBR3, VCAN, KLRD1, NCALD, CTSW, SLFN12L
PC_ 3
Positive: LEF1, CAMK4, INPP4B, PDE3B, IL7R, FHIT, BACH2, MAML2, TSHZ2, SERINC5
BCL2, EEF1A1, NELL2, ANK3, CCR7, PRKCA, TCF7, BCL11B, LTB, AC139720.1
OXNAD1, MLLT3, RPL11, AL589693.1, TRABD2A, RPL13, RASGRF2, MAL, NR3C2, LDHB
Negative: GNLY, NKG7, CCL5, PRF1, GZMA, GZMH, KLRD1, SPON2, FGFBP2, CST7
GZMB, CCL4, TGFBR3, KLRF1, CTSW, BNC2, ADGRG1, PDGFD, PPP2R2B, PYHIN1
IL2RB, CLIC3, A2M, NCAM1, MCTP2, TRDC, C1orf21, SAMD3, MYBL1, FCRL6
PC_ 4
Positive: TCF4, CUX2, AC023590.1, LINC01374, RHEX, LINC01478, EPHB1, PLD4, PTPRS, ZFAT
LINC00996, CLEC4C, LILRA4, COL26A1, PLXNA4, SCN9A, CCDC50, FAM160A1, ITM2C, COL24A1
UGCG, RUNX2, GZMB, NRP1, PHEX, IRF8, JCHAIN, SLC35F3, PACSIN1, AC007381.1
Negative: BANK1, IGHM, MS4A1, PAX5, IGKC, FCRL1, EBF1, RALGPS2, CD79A, LINC00926
OSBPL10, GNLY, COL19A1, IGHD, NKG7, CD22, BACH2, CCL5, PLEKHG1, CD79B
ADAM28, LIX1-AS1, LARGE1, STEAP1B, AP002075.1, AC120193.1, BLK, FCER2, ARHGAP24, IFNG-AS1
PC_ 5
Positive: VCAN, PLCB1, ANXA1, PLXDC2, DPYD, CD36, ARHGAP26, ARHGAP24, ACSL1, AC020916.1
LRMDA, DYSF, CSF3R, MEGF9, PDE4D, S100A8, FNDC3B, ADAMTSL4-AS1, S100A9, TMTC2
RAB11FIP1, CREB5, LRRK2, NAMPT, RBM47, JUN, SLC2A3, GNLY, GLT1D1, VCAN-AS1
Negative: CDKN1C, FCGR3A, TCF7L2, IFITM3, CST3, MTSS1, AIF1, LST1, MS4A7, PSAP
WARS, ACTB, HLA-DPA1, SERPINA1, CD74, FCER1G, IFI30, COTL1, CFD, SMIM25
HLA-DRA, HLA-DRB1, FMNL2, CSF1R, FTL, MAFB, TMSB4X, TYROBP, HLA-DPB1, CCDC26
06:10:45 UMAP embedding parameters a = 0.9922 b = 1.112
06:10:45 Read 10412 rows and found 50 numeric columns
06:10:45 Using Annoy for neighbor search, n_neighbors = 30
06:10:45 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:10:46 Writing NN index file to temp file /tmp/Rtmp3eAPNF/file3dd9069c738d7
06:10:46 Searching Annoy index using 1 thread, search_k = 3000
06:10:49 Annoy recall = 100%
06:10:50 Commencing smooth kNN distance calibration using 1 thread
06:10:52 Initializing from normalized Laplacian + noise
06:10:53 Commencing optimization for 200 epochs, with 443480 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:10:59 Optimization finished
There were 50 or more warnings (use warnings() to see the first 50)
分析ATAC數(shù)據(jù),ATAC數(shù)據(jù)在這里也是一個矩陣纱注,只是不是RNA含量而是fragments矩陣畏浆。但是ATAC的assays更豐富。
str(pbmc@assays$ATAC,max.level = 2)
Formal class 'ChromatinAssay' [package "Signac"] with 16 slots
..@ ranges :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
..@ motifs : NULL
..@ fragments :List of 1
..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
..@ annotation :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
..@ bias : NULL
..@ positionEnrichment: list()
..@ links :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ scale.data : num[0 , 0 ]
..@ key : chr "atac_"
..@ assay.orig : NULL
..@ var.features : chr [1:105913] "chr19-1230013-1281741" "chr19-39381822-39438328" "chr17-7832492-7859042" "chr19-12774552-12797715" ...
..@ meta.features :'data.frame': 105913 obs. of 2 variables:
..@ misc : NULL
如fragments矩陣信息
head(pbmc@assays$ATAC@counts)
6 x 10412 sparse Matrix of class "dgCMatrix"
[[ suppressing 76 column names 'AAACAGCCAAGGAATC-1', 'AAACAGCCAATCCCTT-1', 'AAACAGCCAATGCGCT-1' ... ]]
chr1-10109-10357 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-180730-181630 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-191491-191736 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-267816-268196 . . . . . . . . . . . . . . . 2 . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-586028-586373 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-629721-630172 . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-10109-10357 . . . ......
chr1-180730-181630 . . . ......
chr1-191491-191736 . . . ......
chr1-267816-268196 . . . ......
chr1-586028-586373 . . . ......
chr1-629721-630172 . . . ......
.....suppressing 10336 columns in show(); maybe adjust 'options(max.print= *, width = *)'
..............................
如注釋信息:
head(pbmc@assays$ATAC@annotation)
GRanges object with 6 ranges and 5 metadata columns:
seqnames ranges strand | tx_id gene_name gene_id gene_biotype type
<Rle> <IRanges> <Rle> | <character> <character> <character> <character> <factor>
ENSE00001489430 chrX 276322-276394 + | ENST00000399012 PLCXD1 ENSG00000182378 protein_coding exon
ENSE00001536003 chrX 276324-276394 + | ENST00000484611 PLCXD1 ENSG00000182378 protein_coding exon
ENSE00002160563 chrX 276353-276394 + | ENST00000430923 PLCXD1 ENSG00000182378 protein_coding exon
ENSE00001750899 chrX 281055-281121 + | ENST00000445062 PLCXD1 ENSG00000182378 protein_coding exon
ENSE00001489388 chrX 281192-281684 + | ENST00000381657 PLCXD1 ENSG00000182378 protein_coding exon
ENSE00001719251 chrX 281194-281256 + | ENST00000429181 PLCXD1 ENSG00000182378 protein_coding exon
-------
seqinfo: 25 sequences from hg38 genome
理解Seurat的數(shù)據(jù)是很重要的狞贱,不然很多分析數(shù)據(jù)可能不知道如何提取調(diào)用刻获,也不知道函數(shù)做了什么。在這之后瞎嬉,我們進行ATAC數(shù)據(jù)的標(biāo)準(zhǔn)分析蝎毡。
# ATAC analysis
# We exclude the first dimension as this is typically correlated with sequencing depth
DefaultAssay(pbmc) <- "ATAC"
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(pbmc, reduction = 'lsi', dims = 2:50, reduction.name = "umap.atac", reduction.key = "atacUMAP_")
注意,這里的以及所有特征選擇之后的再次降維用到的維度dims
,都需要謹(jǐn)慎選擇氧枣,我們知道每個坐標(biāo)值都有一系列特征支持程度(如loading值),可以反映出該維度代表的生物學(xué)意義沐兵。如這里的dims = 2:50
,排除了第一個維度便监,因為作者發(fā)現(xiàn)它通常與排序深度相關(guān)
pbmc@reductions$lsi@feature.loadings[1:4,1:4]
LSI_1 LSI_2 LSI_3 LSI_4
chr19-1230013-1281741 0.008759956 0.0022949972 -0.0008663089 -0.0019122743
chr19-39381822-39438328 0.008735519 -0.0007909501 -0.0018568479 -0.0018096328
chr17-7832492-7859042 0.008713235 0.0012451020 -0.0006112821 0.0004232284
chr19-12774552-12797715 0.008700547 0.0015833131 -0.0014307690 0.0013240653
用Signac分析過scATAC的同學(xué)對這里的流程不會感到陌生扎谎,如果是第一次接觸ATAC分析會覺得比較陌生了,如RunTFIDF
是什么?這時候我們可以?RunTFIDF
來查看具體信息以及參考文獻烧董。
?RunTFIDF
method
Which TF-IDF implementation to use. Choice of:
* 1: The TF-IDF implementation used by Stuart & Butler et al. 2019 (https://doi.org/10.1101/460147). This computes *\log(TF \times IDF)*.
* 2: The TF-IDF implementation used by Cusanovich & Hill et al. 2018 (https://doi.org/10.1016/j.cell.2018.06.052). This computes *TF \times (\log(IDF))*
* 3: The log-TF method used by Andrew Hill (http://andrewjohnhill.com/blog/2019/05/06/dimensionality-reduction-for-scatac-data/). This computes *\log(TF) \times \log(IDF)*.
* 4: The 10x Genomics method (no TF normalization). This computes *IDF*.
我們計算一個WNN圖簿透,表示RNA和ATAC-seq模式的加權(quán)組合。我們將此圖用于UMAP的可視化和聚類解藻。我們在Seurat 4.0 || 分析scRNA和表面抗體數(shù)據(jù)
中查看過FindMultiModalNeighbors
之后的數(shù)據(jù)結(jié)構(gòu)老充。
pbmc <- FindMultiModalNeighbors(pbmc, reduction.list = list("pca", "lsi"), dims.list = list(1:50, 2:50))
Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s
Calculating kernel bandwidths
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Finding multimodal neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=20s
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Constructing multimodal KNN graph
Constructing multimodal SNN graph
pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
06:41:32 UMAP embedding parameters a = 0.9922 b = 1.112
06:41:33 Commencing smooth kNN distance calibration using 1 thread
06:41:35 Initializing from normalized Laplacian + noise
06:41:36 Commencing optimization for 200 epochs, with 308698 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:41:42 Optimization finished
pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, verbose = FALSE)
這里聚類用的算法是SLM
algorithm
Algorithm for modularity optimization
(1 = original Louvain algorithm;
2 = Louvain algorithm with multilevel refinement;
3 = SLM algorithm;
4 = Leiden algorithm). Leiden requires the leidenalg python.
這些事圖聚類的話語體系,我們知道在構(gòu)建了細(xì)胞間的圖結(jié)構(gòu)之后螟左,要聚類就需要計算細(xì)胞間的相互關(guān)系啡浊,這里是網(wǎng)略數(shù)據(jù)科學(xué)的一個重要領(lǐng)域:社區(qū)發(fā)現(xiàn)(community detection),如這里的SLM (smart local moving)算法是一種在大型網(wǎng)絡(luò)中進行社區(qū)發(fā)現(xiàn)(或聚類)的算法胶背。SLM算法最大化了所謂的模塊化函數(shù)巷嚣。該算法已成功應(yīng)用于具有數(shù)千萬個節(jié)點和數(shù)億條邊的網(wǎng)絡(luò)。算法的細(xì)節(jié)記錄在一篇論文中(Waltman & Van Eck, 2013)钳吟。在igraph中實現(xiàn)SLM可以這樣:
#devtools::install_github("chen198328/slm")
library(igraph)
library(slm)
net <- graph_from_adjacency_matrix(pbmc[["wsnn"]])
slm.net <-slm.community(net)
當(dāng)然這對大部分不做算法調(diào)整的同學(xué)來講并沒有什么用廷粒,這里只是說明我們的聚類是在圖空間中進行的。
接下來我們對細(xì)胞群進行注釋。注意坝茎,您還可以使用我們構(gòu)建好的PBMC參考數(shù)據(jù)集來進行注釋涤姊,可以用Seurat(vignette見官網(wǎng))或自動化web工具:Azimuth。
# perform sub-clustering on cluster 6 to find additional structure
pbmc <- FindSubCluster(pbmc, cluster = 6, graph.name = "wsnn", algorithm = 3)
Idents(pbmc) <- "sub.cluster"
指定某群在聚類這個函數(shù)好用啊
table(pbmc$seurat_clusters)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
2386 1159 973 606 558 502 498 470 438 438 372 352 329 285 256 198 163 142 137 106 26 18
> table(pbmc$sub.cluster)
0 1 10 11 12 13 14 15 16 17 18 19 2 20 21 3 4 5 6_0 6_1 7 8 9
2386 1159 372 352 329 285 256 198 163 142 137 106 973 26 18 606 558 502 274 224 470 438 438
其他沒變嗤放,cluster被分成了兩組:6_0 ,6_1
思喊。
# add annotations
pbmc <- RenameIdents(pbmc, '19' = 'pDC','20' = 'HSPC','15' = 'cDC')
pbmc <- RenameIdents(pbmc, '0' = 'CD14 Mono', '8' ='CD14 Mono', '5' = 'CD16 Mono')
pbmc <- RenameIdents(pbmc, '17' = 'Naive B', '11' = 'Intermediate B', '10' = 'Memory B', '21' = 'Plasma')
pbmc <- RenameIdents(pbmc, '7' = 'NK')
pbmc <- RenameIdents(pbmc, '4' = 'CD4 TCM', '13'= "CD4 TEM", '3' = "CD4 TCM", '16' ="Treg", '1' ="CD4 Naive", '14' = "CD4 Naive")
pbmc <- RenameIdents(pbmc, '2' = 'CD8 Naive', '9'= "CD8 Naive", '12' = 'CD8 TEM_1', '6_0' = 'CD8 TEM_2', '6_1' ='CD8 TEM_2')
pbmc <- RenameIdents(pbmc, '18' = 'MAIT')
pbmc <- RenameIdents(pbmc, '6_2' ='gdT', '6_3' = 'gdT')
pbmc$celltype <- Idents(pbmc)
pbmc
An object of class Seurat
166710 features across 10412 samples within 3 assays
Active assay: ATAC (105913 features, 105913 variable features)
2 other assays present: RNA, SCT
5 dimensional reductions calculated: pca, umap.rna, lsi, umap.atac, wnn.umap
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA percent.mt nCount_ATAC nFeature_ATAC replicate nCount_SCT nFeature_SCT SCT.weight wsnn_res.0.8
AAACAGCCAAGGAATC-1 SeuratProject 8380 3308 7.470167 55550 13867 para 4685 2693 0.4514450 1
AAACAGCCAATCCCTT-1 SeuratProject 3771 1896 10.527711 20485 7247 para 3781 1895 0.5069799 4
AAACAGCCAATGCGCT-1 SeuratProject 6876 2904 6.457243 16674 6528 para 4686 2769 0.4324867 1
AAACAGCCACCAACCG-1 SeuratProject 5415 2282 6.500462 7658 3323 ctrl 4486 2278 0.4738122 2
AAACAGCCAGGATAAC-1 SeuratProject 2759 1353 6.922798 10355 4267 ctrl 3332 1352 0.5361155 1
AAACAGCCAGTAGGTG-1 SeuratProject 7614 3061 6.895193 39441 11628 ctrl 4711 2764 0.5273038 9
seurat_clusters sub.cluster celltype
AAACAGCCAAGGAATC-1 1 1 CD4 Naive
AAACAGCCAATCCCTT-1 4 4 CD4 TCM
AAACAGCCAATGCGCT-1 1 1 CD4 Naive
AAACAGCCACCAACCG-1 2 2 CD8 Naive
AAACAGCCAGGATAAC-1 1 1 CD4 Naive
AAACAGCCAGTAGGTG-1 9 9 CD8 Naive
基于基因表達(dá)可視化聚類,ATAC-seq次酌,或WNN分析恨课。與前面的分析相比,差異更加微妙(您可以探索權(quán)重岳服,它比我們的CITE-seq示例中更均勻地分割)剂公,但是我們發(fā)現(xiàn)WNN提供了最清晰的細(xì)胞狀態(tài)分離。
p1 <- DimPlot(pbmc, reduction = "umap.rna", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("RNA")
p2 <- DimPlot(pbmc, reduction = "umap.atac", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("ATAC")
p3 <- DimPlot(pbmc, reduction = "wnn.umap", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("WNN")
p1 + p2 + p3 & NoLegend() & theme(plot.title = element_text(hjust = 0.5))
例如吊宋,ATAC-seq數(shù)據(jù)有助于CD4和CD8 T細(xì)胞狀態(tài)的分離纲辽,這是由于多個基因座的存在。在不同的T細(xì)胞亞型之間表現(xiàn)出不同的可及性贫母。例如文兑,我們可以利用Signac可視化工具盒刚,將CD8A位點的“偽體(pseudobulk)”軌跡與基因表達(dá)水平的小提琴圖一起可視化腺劣。
## to make the visualization easier, subset T cell clusters
celltype.names <- levels(pbmc)
tcell.names <- grep("CD4|CD8|Treg", celltype.names,value = TRUE)
tcells <- subset(pbmc, idents = tcell.names)
CoveragePlot(tcells, region = 'CD8A', features = 'CD8A', assay = 'ATAC', expression.assay = 'SCT', peaks = FALSE)
接下來,我們將檢查每個細(xì)胞的可達(dá)區(qū)域因块,以確定富集基序(enriched motifs)橘原。正如Signac motifs vignette中所描述的,有幾種方法可以做到這一點涡上,但我們將使用Greenleaf實驗室的chromVAR包趾断。這將計算已知motifs的每個單元的可訪問性得分,并將這些得分作為Seurat對象中的第三個分析(chromVAR)添加吩愧。
再走下去芋酌,請確保安裝了以下軟件包。
- chromVAR for the analysis of motif accessibility in scATAC-seq
- presto for fast differential expression analyses.
- TFBSTools for TFBS analysis
- JASPAR2020 for JASPAR motif models
- motifmatchr for motif matching
- BSgenome.Hsapiens.UCSC.hg38 for chromVAR
我先library一下看看哪些是安裝過的雁佳。
library(chromVAR)
library(JASPAR2020)
library(TFBSTools)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg38)
針對沒安裝的脐帝,我們可以這樣來安裝。
remotes::install_github("immunogenomics/presto")
BiocManager::install(c("chromVAR", "TFBSTools", "JASPAR2020", "motifmatchr", "BSgenome.Hsapiens.UCSC.hg38"))
# Scan the DNA sequence of each peak for the presence of each motif, and create a Motif object
DefaultAssay(pbmc) <- "ATAC"
pwm_set <- getMatrixSet(x = JASPAR2020, opts = list(species = 9606, all_versions = FALSE))
motif.matrix <- CreateMotifMatrix(features = granges(pbmc), pwm = pwm_set, genome = 'hg38', use.counts = FALSE)
motif.object <- CreateMotifObject(data = motif.matrix, pwm = pwm_set)
pbmc <- SetAssayData(pbmc, assay = 'ATAC', slot = 'motifs', new.data = motif.object)
# Note that this step can take 30-60 minutes
pbmc <- RunChromVAR(
object = pbmc,
genome = BSgenome.Hsapiens.UCSC.hg38
)
pbmc
An object of class Seurat
166710 features across 10412 samples within 3 assays
Active assay: ATAC (105913 features, 105913 variable features)
2 other assays present: RNA, SCT
5 dimensional reductions calculated: pca, umap.rna, lsi, umap.atac, wnn.umap
最后糖权,我們探索多模態(tài)數(shù)據(jù)集來識別每個細(xì)胞狀態(tài)的關(guān)鍵調(diào)節(jié)器(key regulators)堵腹。配對數(shù)據(jù)提供了一個獨特的機會來識別滿足多種標(biāo)準(zhǔn)的轉(zhuǎn)錄因子(TFs),幫助縮小假定的調(diào)控因子的名單列表星澳。我們的目標(biāo)是在RNA測量中識別在多種細(xì)胞類型中表達(dá)豐富的TFs疚顷,同時在ATAC測量中富集其motifs 的可達(dá)性。
作為陽性對照,CCAAT增強子結(jié)合蛋白(CEBP)家族蛋白腿堤,包括TF CEBPB阀坏,已多次證明在單核細(xì)胞和樹突狀細(xì)胞等髓系細(xì)胞的分化和功能中發(fā)揮重要作用。我們可以看到CEBPB的表達(dá)和編碼CEBPB結(jié)合位點的MA0466.2.4基序的可及性都在單核細(xì)胞中富集释液。
#returns MA0466.2
motif.name <- ConvertMotifID(pbmc, name = 'CEBPB')
gene_plot <- FeaturePlot(pbmc, features = "sct_CEBPB", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
我們想要量化這個關(guān)系全释,并搜索所有的細(xì)胞類型來找到類似的例子。為此误债,我們將使用presto
包來執(zhí)行快差異分析浸船。我們進行了兩項測試 :一項使用基因表達(dá)數(shù)據(jù),另一項使用chromVAR motif可達(dá)性寝蹈。presto根據(jù)Wilcox秩和檢驗(這也是Seurat的默認(rèn)方法)計算一個p值李命,我們將搜索限制為在兩個測試中都返回顯著結(jié)果的TFs。
presto
還計算了“AUC”統(tǒng)計值箫老,它反映了每個基因(或基序)作為細(xì)胞類型標(biāo)記的能力封字。曲線下面積(AUC)的最大值為1表示標(biāo)記良好。由于基因和基序的AUC統(tǒng)計值是相同的耍鬓,因此我們將兩種測試的AUC值取平均值阔籽,并以此來對每種細(xì)胞類型的TFs進行排序:
markers_rna <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'SCT')
markers_motifs <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'chromvar')
motif.names <- markers_motifs$feature
colnames(markers_rna) <- paste0("RNA.", colnames(markers_rna))
colnames(markers_motifs) <- paste0("motif.", colnames(markers_motifs))
markers_rna$gene <- markers_rna$RNA.feature
markers_motifs$gene <- ConvertMotifID(pbmc, id = motif.names)
head(markers_motifs)
motif.feature motif.group motif.avgExpr motif.logFC motif.statistic motif.auc motif.pval motif.padj motif.pct_in
1 MA0030.1 CD14 Mono -0.2070450 -0.2837042 8748968 0.4082863 4.331768e-47 5.397656e-47 100
2 MA0031.1 CD14 Mono -0.6256700 -0.8643914 5368273 0.2505201 0.000000e+00 0.000000e+00 100
3 MA0051.1 CD14 Mono 0.1237673 0.1974597 11955125 0.5579074 9.048795e-20 1.045235e-19 100
4 MA0057.1 CD14 Mono 0.7245745 1.0173693 16968911 0.7918847 0.000000e+00 0.000000e+00 100
5 MA0059.1 CD14 Mono 0.9826472 1.3781270 17525266 0.8178480 0.000000e+00 0.000000e+00 100
6 MA0066.1 CD14 Mono 1.6772753 2.3573633 19347910 0.9029049 0.000000e+00 0.000000e+00 100
motif.pct_out gene
1 100 FOXF2
2 100 FOXD1
3 100 IRF2
4 100 MZF1(var.2)
5 100 MAX::MYC
6 100 PPARG
尋找toptf的函數(shù)
# a simple function to implement the procedure above
topTFs <- function(celltype, padj.cutoff = 1e-2) {
ctmarkers_rna <- dplyr::filter(
markers_rna, RNA.group == celltype, RNA.padj < padj.cutoff, RNA.logFC > 0) %>%
arrange(-RNA.auc)
ctmarkers_motif <- dplyr::filter(
markers_motifs, motif.group == celltype, motif.padj < padj.cutoff, motif.logFC > 0) %>%
arrange(-motif.auc)
top_tfs <- inner_join(
x = ctmarkers_rna[, c(2, 11, 6, 7)],
y = ctmarkers_motif[, c(2, 1, 11, 6, 7)], by = "gene"
)
top_tfs$avg_auc <- (top_tfs$RNA.auc + top_tfs$motif.auc) / 2
top_tfs <- arrange(top_tfs, -avg_auc)
return(top_tfs)
}
我們現(xiàn)在可以計算和可視化任何細(xì)胞類型的putative regulators。我們恢復(fù)了成熟的調(diào)控因子牲蜀,包括用于NK細(xì)胞的TBX21(這里的背景知識很重要啊)笆制,漿細(xì)胞的IRF4,造血祖細(xì)胞的SOX4涣达,B細(xì)胞的EBF1和PAX5在辆,pDC的IRF8和TCF4。我們相信度苔,類似的策略可以用來幫助關(guān)注不同系統(tǒng)中一組putative regulators匆篓。
# identify top markers in NK and visualize
head(topTFs("NK"), 3)
head(topTFs("NK"), 3)
RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc motif.pval avg_auc
1 NK TBX21 0.7269867 0.000000e+00 NK MA0690.1 0.9175550 3.489416e-206 0.8222709
2 NK EOMES 0.5898072 8.133395e-101 NK MA0800.1 0.9238892 2.013461e-212 0.7568482
3 NK RUNX3 0.7712620 9.023079e-121 NK MA0684.2 0.6923503 3.069778e-45 0.7318062
motif.name <- ConvertMotifID(pbmc, name = 'TBX21')
gene_plot <- FeaturePlot(pbmc, features = "sct_TBX21", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in pDC and visualize
head(topTFs("pDC"), 3)
head(topTFs("pDC"), 3)
RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc motif.pval avg_auc
1 pDC TCF4 0.9998833 3.347777e-163 pDC MA0830.2 0.9952958 3.912770e-69 0.9975896
2 pDC IRF8 0.9905372 2.063258e-124 pDC MA0652.1 0.8714643 1.143055e-39 0.9310008
3 pDC RUNX2 0.9754498 7.815119e-112 pDC MA0511.2 0.8355034 1.126551e-32 0.9054766
motif.name <- ConvertMotifID(pbmc, name = 'TCF4')
gene_plot <- FeaturePlot(pbmc, features = "sct_TCF4", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in HSPC and visualize
head(topTFs("CD16 Mono"),3)
RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc motif.pval avg_auc
1 CD16 Mono TBX21 0.6550599 8.750032e-193 CD16 Mono MA0690.1 0.8896766 2.483767e-191 0.7723682
2 CD16 Mono EOMES 0.5862998 3.802279e-99 CD16 Mono MA0800.1 0.8922904 7.043330e-194 0.7392951
3 CD16 Mono RUNX3 0.7182799 7.394573e-84 CD16 Mono MA0684.2 0.6753424 3.175696e-40 0.6968111
motif.name <- ConvertMotifID(pbmc, name = 'SPI1')
gene_plot <- FeaturePlot(pbmc, features = "sct_SPI1", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
> # identify top markers in other cell types
> head(topTFs("Naive B"), 3)
RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc motif.pval avg_auc
1 Naive B EBF1 0.9604105 0.000000e+00 Naive B MA0154.4 0.7562941 8.038044e-26 0.8583523
2 Naive B POU2F2 0.6338272 2.833732e-09 Naive B MA0507.1 0.9702895 7.975628e-83 0.8020583
3 Naive B TCF4 0.7139073 2.689186e-41 Naive B MA0830.2 0.8641510 2.153784e-50 0.7890291
> head(topTFs("HSPC"), 3)
RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc motif.pval avg_auc
1 HSPC SOX4 0.9869221 7.766427e-71 HSPC MA0867.2 0.6967552 5.188151e-04 0.8418387
2 HSPC GATA2 0.7884615 0.000000e+00 HSPC MA0036.3 0.8240827 1.084319e-08 0.8062721
3 HSPC MEIS1 0.8830582 0.000000e+00 HSPC MA0498.2 0.7048912 3.010760e-04 0.7939747
> head(topTFs("Plasma"), 3)
RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc motif.pval avg_auc
1 Plasma IRF4 0.8189901 5.206829e-35 Plasma MA1419.1 0.9821051 1.452382e-12 0.9005476
2 Plasma MEF2C 0.9070644 4.738760e-12 Plasma MA0497.1 0.7525389 2.088028e-04 0.8298016
3 Plasma TCF4 0.8052937 5.956053e-12 Plasma MA0830.2 0.7694984 7.585013e-05 0.7873960
https://satijalab.org/seurat/v4.0/weighted_nearest_neighbor_analysis.html
https://monkeylearn.com/blog/what-is-tf-idf/
https://www.biorxiv.org/content/10.1101/460147v1.full
http://www.ludowaltman.nl/slm/#:~:text=The%20SLM%20algorithm%20has%20been%20implemented%20in%20the,be%20run%20on%20any%20system%20with%20Java%20support.
https://www.cwts.nl/blog?article=n-r2u2a4
https://github.com/chen198328/slm