Seurat 4.0 || WNN整合scRNA和scATAC數(shù)據(jù)

前情回顧

Seurat 4.0 ||單細(xì)胞數(shù)據(jù)分析工具箱有更新
 Seurat 4.0 ||單細(xì)胞多模態(tài)數(shù)據(jù)整合算法WNN
Seurat 4.0 || 分析scRNA和表面抗體數(shù)據(jù)

單細(xì)胞數(shù)據(jù)數(shù)據(jù)正在走向統(tǒng)一：在一個細(xì)胞內(nèi)同時測量分屬中心法則不同階段的多種數(shù)據(jù)類型昔瞧。這需要我們開發(fā)新的方法來將這些數(shù)據(jù)整合在一起以描繪一個完整的細(xì)胞狀態(tài)典奉。Seurat升級到4.0以后驹吮，在一個Seurat對象中可以存儲（數(shù)據(jù)結(jié)構(gòu)）和計算（算法）單細(xì)胞多模態(tài)數(shù)據(jù)虱饿。本文我們跟著官方教程演示使用WNN分析多模態(tài)技術(shù)分析10xscRNA+ATAC數(shù)據(jù)。使用的數(shù)據(jù)集在10x網(wǎng)站上公開标捺，是為10412個同時測量轉(zhuǎn)錄組和ATAC的PBMCs細(xì)胞套才。

在這個例子中漠趁，我們將演示：

使用成對的轉(zhuǎn)錄組和ATAC-seq數(shù)據(jù)創(chuàng)建多模式Seurat對象
對單細(xì)胞RNA+ATAC數(shù)據(jù)進行加權(quán)近鄰聚類
利用這兩種模式來識別不同細(xì)胞類型和狀態(tài)的調(diào)節(jié)因子

你可以從10x dataset網(wǎng)站下載數(shù)據(jù)集。請務(wù)必下載以下文件：

Filtered feature barcode matrix (HDF5)
ATAC Per fragment information file (TSV.GZ)
ATAC Per fragment information index (TSV.GZ index) # 這個索引文件別忘了哦
網(wǎng)址：https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k?

最后盾沫，為了運行本示例裁赠，請確保安裝了以下軟件包:

Seurat v4.
Signac for the analysis of single-cell chromatin datasets
EnsDb.Hsapiens.v86 for a set of annotations for hg38
dplyr to help manipulate data tables

library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86)
library(dplyr)
library(ggplot2)

我們將基于基因表達(dá)數(shù)據(jù)創(chuàng)建一個Seurat對象，然后添加ATAC-seq數(shù)據(jù)作為第二個assay赴精。關(guān)于創(chuàng)建和處理ATAC對象的更多信息佩捞，可以瀏覽Signac相關(guān)文檔。

# the 10x hdf5 file contains both data types. 
inputdata.10x <- Read10X_h5("../data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")

# extract RNA and ATAC data
rna_counts <- inputdata.10x$`Gene Expression`
atac_counts <- inputdata.10x$Peaks

# Create Seurat object
pbmc <- CreateSeuratObject(counts = rna_counts)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

head(pbmc@meta.data)
                      orig.ident nCount_RNA nFeature_RNA percent.mt
AAACAGCCAAGGAATC-1 SeuratProject       8380         3308   7.470167
AAACAGCCAATCCCTT-1 SeuratProject       3771         1896  10.527711
AAACAGCCAATGCGCT-1 SeuratProject       6876         2904   6.457243
AAACAGCCACACTAAT-1 SeuratProject       1733          846  18.003462
AAACAGCCACCAACCG-1 SeuratProject       5415         2282   6.500462
AAACAGCCAGGATAAC-1 SeuratProject       2759         1353   6.922798

VlnPlot(pbmc,features=c('percent.mt','nCount_RNA', 'nFeature_RNA'),pt.size = 0,cols = "red")

添加ATAC數(shù)據(jù)蕾哟。


# Now add in the ATAC-seq data
# we'll only use peaks in standard chromosomes
grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
atac_counts <- atac_counts[as.vector(grange.use), ]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"

frag.file <- "/home/data/vip06/dataset/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"



chrom_assay <- CreateChromatinAssay(
  counts = atac_counts,
  sep = c(":", "-"),
  genome = 'hg38',
  fragments = frag.file,
  min.cells = 10,
  annotation = annotations
)
pbmc[["ATAC"]] <- chrom_assay

查看數(shù)據(jù)集對象：


pbmc[["ATAC"]]
ChromatinAssay data with 105913 features for 11909 cells
Variable features: 0 
Genome: hg38 
Annotation present: TRUE 
Motifs present: FALSE 
Fragment files: 1 


pbmc
An object of class Seurat 
142514 features across 11909 samples within 2 assays 
Active assay: RNA (36601 features, 0 variable features)
 1 other assay present: ATAC

我們根據(jù)每種模式檢測到的分子數(shù)以及線粒體百分比來執(zhí)行基本的QC一忱，其實這一步什么也沒有Q掉，只是獲得數(shù)據(jù)概覽谭确，了解數(shù)據(jù)的質(zhì)量分布帘营。

 head(pbmc@meta.data)
                      orig.ident nCount_RNA nFeature_RNA percent.mt nCount_ATAC nFeature_ATAC
AAACAGCCAAGGAATC-1 SeuratProject       8380         3308   7.470167       55550         13867
AAACAGCCAATCCCTT-1 SeuratProject       3771         1896  10.527711       20485          7247
AAACAGCCAATGCGCT-1 SeuratProject       6876         2904   6.457243       16674          6528
AAACAGCCACACTAAT-1 SeuratProject       1733          846  18.003462        2007           906
AAACAGCCACCAACCG-1 SeuratProject       5415         2282   6.500462        7658          3323
AAACAGCCAGGATAAC-1 SeuratProject       2759         1353   6.922798       10355          4267


VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "nCount_ATAC", "nFeature_ATAC"), ncol = 3,
        log = TRUE, pt.size = 0) + NoLegend()

如果是多個樣本這樣的展示方式可以看到每個樣本的質(zhì)量差異，咦逐哈，WNN框架下如何執(zhí)行多樣本分析仪吧？這里我們還是假設(shè)有那么一組吧，人為設(shè)置一個分組鞠眉。

pbmc$replicate <- sample(c("ctrl", "para"), size = ncol(pbmc), replace = TRUE)
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "nCount_ATAC", "nFeature_ATAC"), ncol = 3,group.by = 'replicate',
        log = TRUE, pt.size = 0) + NoLegend()

看到數(shù)據(jù)的質(zhì)量分布之后薯鼠，我們可以執(zhí)行真正的質(zhì)控了择诈，把異常值或者不是細(xì)胞的barcode去掉，保證進入分析的都是細(xì)胞出皇，而非雜質(zhì)羞芍。

pbmc <- subset(
  x = pbmc,
  subset = nCount_ATAC < 7e4 &
    nCount_ATAC > 5e3 &
    nCount_RNA < 25000 &
    nCount_RNA > 1000 &
    percent.mt < 20
)

接下來，我們使用RNA和ATAC-seq數(shù)據(jù)的標(biāo)準(zhǔn)方法郊艘，分別對這兩種檢測方法進行預(yù)處理和降維荷科。

# RNA analysis
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:50, reduction.name = 'umap.rna', reduction.key = 'rnaUMAP_')

PC_ 1 
Positive:  VCAN, PLXDC2, SAT1, SLC8A1, NAMPT, NEAT1, ZEB2, DPYD, LRMDA, LYZ 
       FCN1, LYN, ANXA1, JAK2, CD36, AOAH, ARHGAP26, TYMP, RBM47, PLCB1 
       PID1, LRRK2, ACSL1, LYST, IRAK3, DMXL2, MCTP1, AC020916.1, GAB2, TBXAS1 
Negative:  EEF1A1, RPL13, BCL2, RPS27, RPL13A, LEF1, BCL11B, BACH2, INPP4B, RPL41 
       IL32, RPS27A, CAMK4, IL7R, RPL3, SKAP1, RPS2, GNLY, LTB, CD247 
       RPS12, RPS18, RPL11, PRKCH, ANK3, RPL23A, THEMIS, CCL5, RPL10, RPS3 
PC_ 2 
Positive:  BANK1, IGHM, AFF3, IGKC, RALGPS2, MS4A1, PAX5, CD74, EBF1, FCRL1 
       CD79A, OSBPL10, LINC00926, COL19A1, BLK, NIBAN3, IGHD, CD22, HLA-DRA, ADAM28 
       CD79B, PLEKHG1, COBLL1, AP002075.1, LIX1-AS1, CCSER1, TCF4, BCL11A, GNG7, STEAP1B 
Negative:  GNLY, CCL5, NKG7, CD247, IL32, PRKCH, PRF1, BCL11B, INPP4B, LEF1 
       IL7R, CAMK4, THEMIS, GZMA, RORA, GZMH, TXK, TC2N, TRBC1, PITPNC1 
       SYNE2, PDE3B, STAT4, TRAC, TGFBR3, VCAN, KLRD1, NCALD, CTSW, SLFN12L 
PC_ 3 
Positive:  LEF1, CAMK4, INPP4B, PDE3B, IL7R, FHIT, BACH2, MAML2, TSHZ2, SERINC5 
       BCL2, EEF1A1, NELL2, ANK3, CCR7, PRKCA, TCF7, BCL11B, LTB, AC139720.1 
       OXNAD1, MLLT3, RPL11, AL589693.1, TRABD2A, RPL13, RASGRF2, MAL, NR3C2, LDHB 
Negative:  GNLY, NKG7, CCL5, PRF1, GZMA, GZMH, KLRD1, SPON2, FGFBP2, CST7 
       GZMB, CCL4, TGFBR3, KLRF1, CTSW, BNC2, ADGRG1, PDGFD, PPP2R2B, PYHIN1 
       IL2RB, CLIC3, A2M, NCAM1, MCTP2, TRDC, C1orf21, SAMD3, MYBL1, FCRL6 
PC_ 4 
Positive:  TCF4, CUX2, AC023590.1, LINC01374, RHEX, LINC01478, EPHB1, PLD4, PTPRS, ZFAT 
       LINC00996, CLEC4C, LILRA4, COL26A1, PLXNA4, SCN9A, CCDC50, FAM160A1, ITM2C, COL24A1 
       UGCG, RUNX2, GZMB, NRP1, PHEX, IRF8, JCHAIN, SLC35F3, PACSIN1, AC007381.1 
Negative:  BANK1, IGHM, MS4A1, PAX5, IGKC, FCRL1, EBF1, RALGPS2, CD79A, LINC00926 
       OSBPL10, GNLY, COL19A1, IGHD, NKG7, CD22, BACH2, CCL5, PLEKHG1, CD79B 
       ADAM28, LIX1-AS1, LARGE1, STEAP1B, AP002075.1, AC120193.1, BLK, FCER2, ARHGAP24, IFNG-AS1 
PC_ 5 
Positive:  VCAN, PLCB1, ANXA1, PLXDC2, DPYD, CD36, ARHGAP26, ARHGAP24, ACSL1, AC020916.1 
       LRMDA, DYSF, CSF3R, MEGF9, PDE4D, S100A8, FNDC3B, ADAMTSL4-AS1, S100A9, TMTC2 
       RAB11FIP1, CREB5, LRRK2, NAMPT, RBM47, JUN, SLC2A3, GNLY, GLT1D1, VCAN-AS1 
Negative:  CDKN1C, FCGR3A, TCF7L2, IFITM3, CST3, MTSS1, AIF1, LST1, MS4A7, PSAP 
       WARS, ACTB, HLA-DPA1, SERPINA1, CD74, FCER1G, IFI30, COTL1, CFD, SMIM25 
       HLA-DRA, HLA-DRB1, FMNL2, CSF1R, FTL, MAFB, TMSB4X, TYROBP, HLA-DPB1, CCDC26 
06:10:45 UMAP embedding parameters a = 0.9922 b = 1.112
06:10:45 Read 10412 rows and found 50 numeric columns
06:10:45 Using Annoy for neighbor search, n_neighbors = 30
06:10:45 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:10:46 Writing NN index file to temp file /tmp/Rtmp3eAPNF/file3dd9069c738d7
06:10:46 Searching Annoy index using 1 thread, search_k = 3000
06:10:49 Annoy recall = 100%
06:10:50 Commencing smooth kNN distance calibration using 1 thread
06:10:52 Initializing from normalized Laplacian + noise
06:10:53 Commencing optimization for 200 epochs, with 443480 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:10:59 Optimization finished
There were 50 or more warnings (use warnings() to see the first 50)

分析ATAC數(shù)據(jù)，ATAC數(shù)據(jù)在這里也是一個矩陣纱注，只是不是RNA含量而是fragments矩陣畏浆。但是ATAC的assays更豐富。

 str(pbmc@assays$ATAC,max.level = 2)
Formal class 'ChromatinAssay' [package "Signac"] with 16 slots
  ..@ ranges            :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  ..@ motifs            : NULL
  ..@ fragments         :List of 1
  ..@ seqinfo           :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  ..@ annotation        :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  ..@ bias              : NULL
  ..@ positionEnrichment: list()
  ..@ links             :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  ..@ counts            :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ data              :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ scale.data        : num[0 , 0 ] 
  ..@ key               : chr "atac_"
  ..@ assay.orig        : NULL
  ..@ var.features      : chr [1:105913] "chr19-1230013-1281741" "chr19-39381822-39438328" "chr17-7832492-7859042" "chr19-12774552-12797715" ...
  ..@ meta.features     :'data.frame':  105913 obs. of  2 variables:
  ..@ misc              : NULL

如fragments矩陣信息

head(pbmc@assays$ATAC@counts)
6 x 10412 sparse Matrix of class "dgCMatrix"
   [[ suppressing 76 column names 'AAACAGCCAAGGAATC-1', 'AAACAGCCAATCCCTT-1', 'AAACAGCCAATGCGCT-1' ... ]]
                                                                                                                                                                    
chr1-10109-10357   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-180730-181630 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-191491-191736 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-267816-268196 . . . . . . . . . . . . . . . 2 . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-586028-586373 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-629721-630172 . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                               
chr1-10109-10357   . . . ......
chr1-180730-181630 . . . ......
chr1-191491-191736 . . . ......
chr1-267816-268196 . . . ......
chr1-586028-586373 . . . ......
chr1-629721-630172 . . . ......

 .....suppressing 10336 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................

如注釋信息：

 head(pbmc@assays$ATAC@annotation)
GRanges object with 6 ranges and 5 metadata columns:
                  seqnames        ranges strand |           tx_id   gene_name         gene_id   gene_biotype     type
                     <Rle>     <IRanges>  <Rle> |     <character> <character>     <character>    <character> <factor>
  ENSE00001489430     chrX 276322-276394      + | ENST00000399012      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001536003     chrX 276324-276394      + | ENST00000484611      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00002160563     chrX 276353-276394      + | ENST00000430923      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001750899     chrX 281055-281121      + | ENST00000445062      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001489388     chrX 281192-281684      + | ENST00000381657      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001719251     chrX 281194-281256      + | ENST00000429181      PLCXD1 ENSG00000182378 protein_coding     exon
  -------
  seqinfo: 25 sequences from hg38 genome

理解Seurat的數(shù)據(jù)是很重要的狞贱，不然很多分析數(shù)據(jù)可能不知道如何提取調(diào)用刻获，也不知道函數(shù)做了什么。在這之后瞎嬉，我們進行ATAC數(shù)據(jù)的標(biāo)準(zhǔn)分析蝎毡。

# ATAC analysis
# We exclude the first dimension as this is typically correlated with sequencing depth
DefaultAssay(pbmc) <- "ATAC"
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(pbmc, reduction = 'lsi', dims = 2:50, reduction.name = "umap.atac", reduction.key = "atacUMAP_")

注意，這里的以及所有特征選擇之后的再次降維用到的維度dims,都需要謹(jǐn)慎選擇氧枣，我們知道每個坐標(biāo)值都有一系列特征支持程度（如loading值）,可以反映出該維度代表的生物學(xué)意義沐兵。如這里的dims = 2:50，排除了第一個維度便监，因為作者發(fā)現(xiàn)它通常與排序深度相關(guān)

pbmc@reductions$lsi@feature.loadings[1:4,1:4]
                              LSI_1         LSI_2         LSI_3         LSI_4
chr19-1230013-1281741   0.008759956  0.0022949972 -0.0008663089 -0.0019122743
chr19-39381822-39438328 0.008735519 -0.0007909501 -0.0018568479 -0.0018096328
chr17-7832492-7859042   0.008713235  0.0012451020 -0.0006112821  0.0004232284
chr19-12774552-12797715 0.008700547  0.0015833131 -0.0014307690  0.0013240653

用Signac分析過scATAC的同學(xué)對這里的流程不會感到陌生扎谎，如果是第一次接觸ATAC分析會覺得比較陌生了，如RunTFIDF是什么?這時候我們可以?RunTFIDF來查看具體信息以及參考文獻烧董。

?RunTFIDF


method
Which TF-IDF implementation to use. Choice of:
*   1: The TF-IDF implementation used by Stuart & Butler et al. 2019 (https://doi.org/10.1101/460147). This computes *\log(TF \times IDF)*.
*   2: The TF-IDF implementation used by Cusanovich & Hill et al. 2018 (https://doi.org/10.1016/j.cell.2018.06.052). This computes *TF \times (\log(IDF))*
*   3: The log-TF method used by Andrew Hill (http://andrewjohnhill.com/blog/2019/05/06/dimensionality-reduction-for-scatac-data/). This computes *\log(TF) \times \log(IDF)*.
*   4: The 10x Genomics method (no TF normalization). This computes *IDF*.

我們計算一個WNN圖簿透，表示RNA和ATAC-seq模式的加權(quán)組合。我們將此圖用于UMAP的可視化和聚類解藻。我們在Seurat 4.0 || 分析scRNA和表面抗體數(shù)據(jù)
中查看過FindMultiModalNeighbors之后的數(shù)據(jù)結(jié)構(gòu)老充。

pbmc <- FindMultiModalNeighbors(pbmc, reduction.list = list("pca", "lsi"), dims.list = list(1:50, 2:50))
Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s  
Calculating kernel bandwidths
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Finding multimodal neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=20s  
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Constructing multimodal KNN graph
Constructing multimodal SNN graph

 pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
06:41:32 UMAP embedding parameters a = 0.9922 b = 1.112
06:41:33 Commencing smooth kNN distance calibration using 1 thread
06:41:35 Initializing from normalized Laplacian + noise
06:41:36 Commencing optimization for 200 epochs, with 308698 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:41:42 Optimization finished

pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, verbose = FALSE)

這里聚類用的算法是SLM

algorithm   

Algorithm for modularity optimization 
(1 = original Louvain algorithm; 
2 = Louvain algorithm with multilevel refinement; 
3 = SLM algorithm; 
4 = Leiden algorithm). Leiden requires the leidenalg python.

這些事圖聚類的話語體系，我們知道在構(gòu)建了細(xì)胞間的圖結(jié)構(gòu)之后螟左，要聚類就需要計算細(xì)胞間的相互關(guān)系啡浊，這里是網(wǎng)略數(shù)據(jù)科學(xué)的一個重要領(lǐng)域：社區(qū)發(fā)現(xiàn)（community detection），如這里的SLM （smart local moving）算法是一種在大型網(wǎng)絡(luò)中進行社區(qū)發(fā)現(xiàn)(或聚類)的算法胶背。SLM算法最大化了所謂的模塊化函數(shù)巷嚣。該算法已成功應(yīng)用于具有數(shù)千萬個節(jié)點和數(shù)億條邊的網(wǎng)絡(luò)。算法的細(xì)節(jié)記錄在一篇論文中(Waltman & Van Eck, 2013)钳吟。在igraph中實現(xiàn)SLM可以這樣：

#devtools::install_github("chen198328/slm")

library(igraph)
library(slm)
net <- graph_from_adjacency_matrix(pbmc[["wsnn"]]) 
slm.net <-slm.community(net)

當(dāng)然這對大部分不做算法調(diào)整的同學(xué)來講并沒有什么用廷粒，這里只是說明我們的聚類是在圖空間中進行的。

接下來我們對細(xì)胞群進行注釋。注意坝茎，您還可以使用我們構(gòu)建好的PBMC參考數(shù)據(jù)集來進行注釋涤姊，可以用Seurat（vignette見官網(wǎng)）或自動化web工具：Azimuth。

# perform sub-clustering on cluster 6 to find additional structure
pbmc <- FindSubCluster(pbmc, cluster = 6, graph.name = "wsnn", algorithm = 3)
Idents(pbmc) <- "sub.cluster"

指定某群在聚類這個函數(shù)好用啊

table(pbmc$seurat_clusters)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21 
2386 1159  973  606  558  502  498  470  438  438  372  352  329  285  256  198  163  142  137  106   26   18 
> table(pbmc$sub.cluster)

   0    1   10   11   12   13   14   15   16   17   18   19    2   20   21    3    4    5  6_0  6_1    7    8    9 
2386 1159  372  352  329  285  256  198  163  142  137  106  973   26   18  606  558  502  274  224  470  438  438

其他沒變嗤放，cluster被分成了兩組：6_0 ,6_1思喊。

# add annotations
pbmc <- RenameIdents(pbmc, '19' = 'pDC','20' = 'HSPC','15' = 'cDC')
pbmc <- RenameIdents(pbmc, '0' = 'CD14 Mono', '8' ='CD14 Mono', '5' = 'CD16 Mono')
pbmc <- RenameIdents(pbmc, '17' = 'Naive B', '11' = 'Intermediate B', '10' = 'Memory B', '21' = 'Plasma')
pbmc <- RenameIdents(pbmc, '7' = 'NK')
pbmc <- RenameIdents(pbmc, '4' = 'CD4 TCM', '13'= "CD4 TEM", '3' = "CD4 TCM", '16' ="Treg", '1' ="CD4 Naive", '14' = "CD4 Naive")
pbmc <- RenameIdents(pbmc, '2' = 'CD8 Naive', '9'= "CD8 Naive", '12' = 'CD8 TEM_1', '6_0' = 'CD8 TEM_2', '6_1' ='CD8 TEM_2')
pbmc <- RenameIdents(pbmc, '18' = 'MAIT')
pbmc <- RenameIdents(pbmc, '6_2' ='gdT', '6_3' = 'gdT')
pbmc$celltype <- Idents(pbmc)

pbmc
An object of class Seurat 
166710 features across 10412 samples within 3 assays 
Active assay: ATAC (105913 features, 105913 variable features)
 2 other assays present: RNA, SCT
 5 dimensional reductions calculated: pca, umap.rna, lsi, umap.atac, wnn.umap

head(pbmc@meta.data)
                      orig.ident nCount_RNA nFeature_RNA percent.mt nCount_ATAC nFeature_ATAC replicate nCount_SCT nFeature_SCT SCT.weight wsnn_res.0.8
AAACAGCCAAGGAATC-1 SeuratProject       8380         3308   7.470167       55550         13867      para       4685         2693  0.4514450            1
AAACAGCCAATCCCTT-1 SeuratProject       3771         1896  10.527711       20485          7247      para       3781         1895  0.5069799            4
AAACAGCCAATGCGCT-1 SeuratProject       6876         2904   6.457243       16674          6528      para       4686         2769  0.4324867            1
AAACAGCCACCAACCG-1 SeuratProject       5415         2282   6.500462        7658          3323      ctrl       4486         2278  0.4738122            2
AAACAGCCAGGATAAC-1 SeuratProject       2759         1353   6.922798       10355          4267      ctrl       3332         1352  0.5361155            1
AAACAGCCAGTAGGTG-1 SeuratProject       7614         3061   6.895193       39441         11628      ctrl       4711         2764  0.5273038            9
                   seurat_clusters sub.cluster  celltype
AAACAGCCAAGGAATC-1               1           1 CD4 Naive
AAACAGCCAATCCCTT-1               4           4   CD4 TCM
AAACAGCCAATGCGCT-1               1           1 CD4 Naive
AAACAGCCACCAACCG-1               2           2 CD8 Naive
AAACAGCCAGGATAAC-1               1           1 CD4 Naive
AAACAGCCAGTAGGTG-1               9           9 CD8 Naive

基于基因表達(dá)可視化聚類，ATAC-seq次酌，或WNN分析恨课。與前面的分析相比，差異更加微妙(您可以探索權(quán)重岳服，它比我們的CITE-seq示例中更均勻地分割)剂公，但是我們發(fā)現(xiàn)WNN提供了最清晰的細(xì)胞狀態(tài)分離。


p1 <- DimPlot(pbmc, reduction = "umap.rna", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("RNA")
p2 <- DimPlot(pbmc, reduction = "umap.atac", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("ATAC")
p3 <- DimPlot(pbmc, reduction = "wnn.umap", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("WNN")
p1 + p2 + p3 & NoLegend() & theme(plot.title = element_text(hjust = 0.5))

例如吊宋，ATAC-seq數(shù)據(jù)有助于CD4和CD8 T細(xì)胞狀態(tài)的分離纲辽，這是由于多個基因座的存在。在不同的T細(xì)胞亞型之間表現(xiàn)出不同的可及性贫母。例如文兑，我們可以利用Signac可視化工具盒刚，將CD8A位點的“偽體（pseudobulk）”軌跡與基因表達(dá)水平的小提琴圖一起可視化腺劣。

## to make the visualization easier, subset T cell clusters
celltype.names <- levels(pbmc)
tcell.names <- grep("CD4|CD8|Treg", celltype.names,value = TRUE)
tcells <- subset(pbmc, idents = tcell.names)
CoveragePlot(tcells, region = 'CD8A', features = 'CD8A', assay = 'ATAC', expression.assay = 'SCT', peaks = FALSE)

接下來，我們將檢查每個細(xì)胞的可達(dá)區(qū)域因块，以確定富集基序（enriched motifs）橘原。正如Signac motifs vignette中所描述的，有幾種方法可以做到這一點涡上，但我們將使用Greenleaf實驗室的chromVAR包趾断。這將計算已知motifs的每個單元的可訪問性得分，并將這些得分作為Seurat對象中的第三個分析(chromVAR)添加吩愧。

再走下去芋酌，請確保安裝了以下軟件包。

chromVAR for the analysis of motif accessibility in scATAC-seq
presto for fast differential expression analyses.
TFBSTools for TFBS analysis
JASPAR2020 for JASPAR motif models
motifmatchr for motif matching
BSgenome.Hsapiens.UCSC.hg38 for chromVAR

我先library一下看看哪些是安裝過的雁佳。

library(chromVAR)
library(JASPAR2020)
library(TFBSTools)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg38)

針對沒安裝的脐帝，我們可以這樣來安裝。

remotes::install_github("immunogenomics/presto")
BiocManager::install(c("chromVAR", "TFBSTools", "JASPAR2020", "motifmatchr", "BSgenome.Hsapiens.UCSC.hg38"))

# Scan the DNA sequence of each peak for the presence of each motif, and create a Motif object
DefaultAssay(pbmc) <- "ATAC"
pwm_set <- getMatrixSet(x = JASPAR2020, opts = list(species = 9606, all_versions = FALSE))
motif.matrix <- CreateMotifMatrix(features = granges(pbmc), pwm = pwm_set, genome = 'hg38', use.counts = FALSE)
motif.object <- CreateMotifObject(data = motif.matrix, pwm = pwm_set)
pbmc <- SetAssayData(pbmc, assay = 'ATAC', slot = 'motifs', new.data = motif.object)

# Note that this step can take 30-60 minutes 
pbmc <- RunChromVAR(
  object = pbmc,
  genome = BSgenome.Hsapiens.UCSC.hg38
)

 pbmc
An object of class Seurat 
166710 features across 10412 samples within 3 assays 
Active assay: ATAC (105913 features, 105913 variable features)
 2 other assays present: RNA, SCT
 5 dimensional reductions calculated: pca, umap.rna, lsi, umap.atac, wnn.umap

最后糖权，我們探索多模態(tài)數(shù)據(jù)集來識別每個細(xì)胞狀態(tài)的關(guān)鍵調(diào)節(jié)器(key regulators)堵腹。配對數(shù)據(jù)提供了一個獨特的機會來識別滿足多種標(biāo)準(zhǔn)的轉(zhuǎn)錄因子(TFs)，幫助縮小假定的調(diào)控因子的名單列表星澳。我們的目標(biāo)是在RNA測量中識別在多種細(xì)胞類型中表達(dá)豐富的TFs疚顷，同時在ATAC測量中富集其motifs 的可達(dá)性。

作為陽性對照，CCAAT增強子結(jié)合蛋白(CEBP)家族蛋白腿堤，包括TF CEBPB阀坏，已多次證明在單核細(xì)胞和樹突狀細(xì)胞等髓系細(xì)胞的分化和功能中發(fā)揮重要作用。我們可以看到CEBPB的表達(dá)和編碼CEBPB結(jié)合位點的MA0466.2.4基序的可及性都在單核細(xì)胞中富集释液。

#returns MA0466.2
motif.name <- ConvertMotifID(pbmc, name = 'CEBPB')
gene_plot <- FeaturePlot(pbmc, features = "sct_CEBPB", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot

我們想要量化這個關(guān)系全释，并搜索所有的細(xì)胞類型來找到類似的例子。為此误债，我們將使用presto包來執(zhí)行快差異分析浸船。我們進行了兩項測試：一項使用基因表達(dá)數(shù)據(jù)，另一項使用chromVAR motif可達(dá)性寝蹈。presto根據(jù)Wilcox秩和檢驗(這也是Seurat的默認(rèn)方法)計算一個p值李命，我們將搜索限制為在兩個測試中都返回顯著結(jié)果的TFs。

presto還計算了“AUC”統(tǒng)計值箫老，它反映了每個基因(或基序)作為細(xì)胞類型標(biāo)記的能力封字。曲線下面積(AUC)的最大值為1表示標(biāo)記良好。由于基因和基序的AUC統(tǒng)計值是相同的耍鬓，因此我們將兩種測試的AUC值取平均值阔籽，并以此來對每種細(xì)胞類型的TFs進行排序:

markers_rna <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'SCT')
markers_motifs <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'chromvar')
motif.names <- markers_motifs$feature
colnames(markers_rna) <- paste0("RNA.", colnames(markers_rna))
colnames(markers_motifs) <- paste0("motif.", colnames(markers_motifs))
markers_rna$gene <- markers_rna$RNA.feature
markers_motifs$gene <- ConvertMotifID(pbmc, id = motif.names)

head(markers_motifs)
  motif.feature motif.group motif.avgExpr motif.logFC motif.statistic motif.auc   motif.pval   motif.padj motif.pct_in
1      MA0030.1   CD14 Mono    -0.2070450  -0.2837042         8748968 0.4082863 4.331768e-47 5.397656e-47          100
2      MA0031.1   CD14 Mono    -0.6256700  -0.8643914         5368273 0.2505201 0.000000e+00 0.000000e+00          100
3      MA0051.1   CD14 Mono     0.1237673   0.1974597        11955125 0.5579074 9.048795e-20 1.045235e-19          100
4      MA0057.1   CD14 Mono     0.7245745   1.0173693        16968911 0.7918847 0.000000e+00 0.000000e+00          100
5      MA0059.1   CD14 Mono     0.9826472   1.3781270        17525266 0.8178480 0.000000e+00 0.000000e+00          100
6      MA0066.1   CD14 Mono     1.6772753   2.3573633        19347910 0.9029049 0.000000e+00 0.000000e+00          100
  motif.pct_out        gene
1           100       FOXF2
2           100       FOXD1
3           100        IRF2
4           100 MZF1(var.2)
5           100    MAX::MYC
6           100       PPARG

尋找toptf的函數(shù)

# a simple function to implement the procedure above
topTFs <- function(celltype, padj.cutoff = 1e-2) {
  ctmarkers_rna <- dplyr::filter(
    markers_rna, RNA.group == celltype, RNA.padj < padj.cutoff, RNA.logFC > 0) %>% 
    arrange(-RNA.auc)
  ctmarkers_motif <- dplyr::filter(
    markers_motifs, motif.group == celltype, motif.padj < padj.cutoff, motif.logFC > 0) %>% 
    arrange(-motif.auc)
  top_tfs <- inner_join(
    x = ctmarkers_rna[, c(2, 11, 6, 7)], 
    y = ctmarkers_motif[, c(2, 1, 11, 6, 7)], by = "gene"
  )
  top_tfs$avg_auc <- (top_tfs$RNA.auc + top_tfs$motif.auc) / 2
  top_tfs <- arrange(top_tfs, -avg_auc)
  return(top_tfs)
}

我們現(xiàn)在可以計算和可視化任何細(xì)胞類型的putative regulators。我們恢復(fù)了成熟的調(diào)控因子牲蜀，包括用于NK細(xì)胞的TBX21(這里的背景知識很重要啊)笆制，漿細(xì)胞的IRF4，造血祖細(xì)胞的SOX4涣达，B細(xì)胞的EBF1和PAX5在辆，pDC的IRF8和TCF4。我們相信度苔，類似的策略可以用來幫助關(guān)注不同系統(tǒng)中一組putative regulators匆篓。

# identify top markers in NK and visualize
head(topTFs("NK"), 3)

head(topTFs("NK"), 3)
  RNA.group  gene   RNA.auc      RNA.pval motif.group motif.feature motif.auc    motif.pval   avg_auc
1        NK TBX21 0.7269867  0.000000e+00          NK      MA0690.1 0.9175550 3.489416e-206 0.8222709
2        NK EOMES 0.5898072 8.133395e-101          NK      MA0800.1 0.9238892 2.013461e-212 0.7568482
3        NK RUNX3 0.7712620 9.023079e-121          NK      MA0684.2 0.6923503  3.069778e-45 0.7318062

motif.name <- ConvertMotifID(pbmc, name = 'TBX21')
gene_plot <- FeaturePlot(pbmc, features = "sct_TBX21", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot

# identify top markers in pDC and visualize
head(topTFs("pDC"), 3)

 head(topTFs("pDC"), 3)
  RNA.group  gene   RNA.auc      RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1       pDC  TCF4 0.9998833 3.347777e-163         pDC      MA0830.2 0.9952958 3.912770e-69 0.9975896
2       pDC  IRF8 0.9905372 2.063258e-124         pDC      MA0652.1 0.8714643 1.143055e-39 0.9310008
3       pDC RUNX2 0.9754498 7.815119e-112         pDC      MA0511.2 0.8355034 1.126551e-32 0.9054766

motif.name <- ConvertMotifID(pbmc, name = 'TCF4')
gene_plot <- FeaturePlot(pbmc, features = "sct_TCF4", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot

# identify top markers in HSPC and visualize
head(topTFs("CD16 Mono"),3)

RNA.group  gene   RNA.auc      RNA.pval motif.group motif.feature motif.auc    motif.pval   avg_auc
1 CD16 Mono TBX21 0.6550599 8.750032e-193   CD16 Mono      MA0690.1 0.8896766 2.483767e-191 0.7723682
2 CD16 Mono EOMES 0.5862998  3.802279e-99   CD16 Mono      MA0800.1 0.8922904 7.043330e-194 0.7392951
3 CD16 Mono RUNX3 0.7182799  7.394573e-84   CD16 Mono      MA0684.2 0.6753424  3.175696e-40 0.6968111

motif.name <- ConvertMotifID(pbmc, name = 'SPI1')
gene_plot <- FeaturePlot(pbmc, features = "sct_SPI1", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot

> # identify top markers in other cell types
> head(topTFs("Naive B"), 3)
  RNA.group   gene   RNA.auc     RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1   Naive B   EBF1 0.9604105 0.000000e+00     Naive B      MA0154.4 0.7562941 8.038044e-26 0.8583523
2   Naive B POU2F2 0.6338272 2.833732e-09     Naive B      MA0507.1 0.9702895 7.975628e-83 0.8020583
3   Naive B   TCF4 0.7139073 2.689186e-41     Naive B      MA0830.2 0.8641510 2.153784e-50 0.7890291
> head(topTFs("HSPC"), 3)
  RNA.group  gene   RNA.auc     RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1      HSPC  SOX4 0.9869221 7.766427e-71        HSPC      MA0867.2 0.6967552 5.188151e-04 0.8418387
2      HSPC GATA2 0.7884615 0.000000e+00        HSPC      MA0036.3 0.8240827 1.084319e-08 0.8062721
3      HSPC MEIS1 0.8830582 0.000000e+00        HSPC      MA0498.2 0.7048912 3.010760e-04 0.7939747
> head(topTFs("Plasma"), 3)
  RNA.group  gene   RNA.auc     RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1    Plasma  IRF4 0.8189901 5.206829e-35      Plasma      MA1419.1 0.9821051 1.452382e-12 0.9005476
2    Plasma MEF2C 0.9070644 4.738760e-12      Plasma      MA0497.1 0.7525389 2.088028e-04 0.8298016
3    Plasma  TCF4 0.8052937 5.956053e-12      Plasma      MA0830.2 0.7694984 7.585013e-05 0.7873960

https://satijalab.org/seurat/v4.0/weighted_nearest_neighbor_analysis.html
https://monkeylearn.com/blog/what-is-tf-idf/
https://www.biorxiv.org/content/10.1101/460147v1.full
http://www.ludowaltman.nl/slm/#:~:text=The%20SLM%20algorithm%20has%20been%20implemented%20in%20the,be%20run%20on%20any%20system%20with%20Java%20support.
https://www.cwts.nl/blog?article=n-r2u2a4
https://github.com/chen198328/slm