Seurat 4.0 || WNN整合scRNA和scATAC數(shù)據(jù)

前情回顧

Seurat 4.0 ||單細(xì)胞數(shù)據(jù)分析工具箱有更新
Seurat 4.0 ||單細(xì)胞多模態(tài)數(shù)據(jù)整合算法WNN
Seurat 4.0 || 分析scRNA和表面抗體數(shù)據(jù)

單細(xì)胞數(shù)據(jù)數(shù)據(jù)正在走向統(tǒng)一:在一個細(xì)胞內(nèi)同時測量分屬中心法則不同階段的多種數(shù)據(jù)類型昔瞧。這需要我們開發(fā)新的方法來將這些數(shù)據(jù)整合在一起以描繪一個完整的細(xì)胞狀態(tài)典奉。Seurat升級到4.0以后驹吮,在一個Seurat對象中可以存儲(數(shù)據(jù)結(jié)構(gòu))和計算(算法)單細(xì)胞多模態(tài)數(shù)據(jù)虱饿。本文我們跟著官方教程演示使用WNN分析多模態(tài)技術(shù)分析10xscRNA+ATAC數(shù)據(jù)。使用的數(shù)據(jù)集在10x網(wǎng)站上公開标捺,是為10412個同時測量轉(zhuǎn)錄組和ATAC的PBMCs細(xì)胞套才。

在這個例子中漠趁,我們將演示:

  • 使用成對的轉(zhuǎn)錄組和ATAC-seq數(shù)據(jù)創(chuàng)建多模式Seurat對象
  • 對單細(xì)胞RNA+ATAC數(shù)據(jù)進行加權(quán)近鄰聚類
  • 利用這兩種模式來識別不同細(xì)胞類型和狀態(tài)的調(diào)節(jié)因子

你可以從10x dataset網(wǎng)站下載數(shù)據(jù)集。請務(wù)必下載以下文件:

Filtered feature barcode matrix (HDF5)
ATAC Per fragment information file (TSV.GZ)
ATAC Per fragment information index (TSV.GZ index) # 這個索引文件別忘了哦
網(wǎng)址:https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k?

最后盾沫,為了運行本示例裁赠,請確保安裝了以下軟件包:

library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86)
library(dplyr)
library(ggplot2)

我們將基于基因表達(dá)數(shù)據(jù)創(chuàng)建一個Seurat對象,然后添加ATAC-seq數(shù)據(jù)作為第二個assay赴精。關(guān)于創(chuàng)建和處理ATAC對象的更多信息佩捞,可以瀏覽Signac相關(guān)文檔。

# the 10x hdf5 file contains both data types. 
inputdata.10x <- Read10X_h5("../data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")

# extract RNA and ATAC data
rna_counts <- inputdata.10x$`Gene Expression`
atac_counts <- inputdata.10x$Peaks

# Create Seurat object
pbmc <- CreateSeuratObject(counts = rna_counts)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

head(pbmc@meta.data)
                      orig.ident nCount_RNA nFeature_RNA percent.mt
AAACAGCCAAGGAATC-1 SeuratProject       8380         3308   7.470167
AAACAGCCAATCCCTT-1 SeuratProject       3771         1896  10.527711
AAACAGCCAATGCGCT-1 SeuratProject       6876         2904   6.457243
AAACAGCCACACTAAT-1 SeuratProject       1733          846  18.003462
AAACAGCCACCAACCG-1 SeuratProject       5415         2282   6.500462
AAACAGCCAGGATAAC-1 SeuratProject       2759         1353   6.922798

VlnPlot(pbmc,features=c('percent.mt','nCount_RNA', 'nFeature_RNA'),pt.size = 0,cols = "red")

添加ATAC數(shù)據(jù)蕾哟。


# Now add in the ATAC-seq data
# we'll only use peaks in standard chromosomes
grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
atac_counts <- atac_counts[as.vector(grange.use), ]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"
frag.file <- "/home/data/vip06/dataset/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"



chrom_assay <- CreateChromatinAssay(
  counts = atac_counts,
  sep = c(":", "-"),
  genome = 'hg38',
  fragments = frag.file,
  min.cells = 10,
  annotation = annotations
)
pbmc[["ATAC"]] <- chrom_assay

查看數(shù)據(jù)集對象:


pbmc[["ATAC"]]
ChromatinAssay data with 105913 features for 11909 cells
Variable features: 0 
Genome: hg38 
Annotation present: TRUE 
Motifs present: FALSE 
Fragment files: 1 


pbmc
An object of class Seurat 
142514 features across 11909 samples within 2 assays 
Active assay: RNA (36601 features, 0 variable features)
 1 other assay present: ATAC

我們根據(jù)每種模式檢測到的分子數(shù)以及線粒體百分比來執(zhí)行基本的QC一忱,其實這一步什么也沒有Q掉,只是獲得數(shù)據(jù)概覽谭确,了解數(shù)據(jù)的質(zhì)量分布帘营。

 head(pbmc@meta.data)
                      orig.ident nCount_RNA nFeature_RNA percent.mt nCount_ATAC nFeature_ATAC
AAACAGCCAAGGAATC-1 SeuratProject       8380         3308   7.470167       55550         13867
AAACAGCCAATCCCTT-1 SeuratProject       3771         1896  10.527711       20485          7247
AAACAGCCAATGCGCT-1 SeuratProject       6876         2904   6.457243       16674          6528
AAACAGCCACACTAAT-1 SeuratProject       1733          846  18.003462        2007           906
AAACAGCCACCAACCG-1 SeuratProject       5415         2282   6.500462        7658          3323
AAACAGCCAGGATAAC-1 SeuratProject       2759         1353   6.922798       10355          4267


VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "nCount_ATAC", "nFeature_ATAC"), ncol = 3,
        log = TRUE, pt.size = 0) + NoLegend()


如果是多個樣本這樣的展示方式可以看到每個樣本的質(zhì)量差異,咦逐哈,WNN框架下如何執(zhí)行多樣本分析仪吧?這里我們還是假設(shè)有那么一組吧,人為設(shè)置一個分組鞠眉。

pbmc$replicate <- sample(c("ctrl", "para"), size = ncol(pbmc), replace = TRUE)
VlnPlot(pbmc, features = c("nCount_RNA", "nFeature_RNA", "percent.mt", "nCount_ATAC", "nFeature_ATAC"), ncol = 3,group.by = 'replicate',
        log = TRUE, pt.size = 0) + NoLegend()

看到數(shù)據(jù)的質(zhì)量分布之后薯鼠,我們可以執(zhí)行真正的質(zhì)控了择诈,把異常值或者不是細(xì)胞的barcode去掉,保證進入分析的都是細(xì)胞出皇,而非雜質(zhì)羞芍。

pbmc <- subset(
  x = pbmc,
  subset = nCount_ATAC < 7e4 &
    nCount_ATAC > 5e3 &
    nCount_RNA < 25000 &
    nCount_RNA > 1000 &
    percent.mt < 20
)

接下來,我們使用RNA和ATAC-seq數(shù)據(jù)的標(biāo)準(zhǔn)方法郊艘,分別對這兩種檢測方法進行預(yù)處理和降維荷科。

# RNA analysis
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:50, reduction.name = 'umap.rna', reduction.key = 'rnaUMAP_')
PC_ 1 
Positive:  VCAN, PLXDC2, SAT1, SLC8A1, NAMPT, NEAT1, ZEB2, DPYD, LRMDA, LYZ 
       FCN1, LYN, ANXA1, JAK2, CD36, AOAH, ARHGAP26, TYMP, RBM47, PLCB1 
       PID1, LRRK2, ACSL1, LYST, IRAK3, DMXL2, MCTP1, AC020916.1, GAB2, TBXAS1 
Negative:  EEF1A1, RPL13, BCL2, RPS27, RPL13A, LEF1, BCL11B, BACH2, INPP4B, RPL41 
       IL32, RPS27A, CAMK4, IL7R, RPL3, SKAP1, RPS2, GNLY, LTB, CD247 
       RPS12, RPS18, RPL11, PRKCH, ANK3, RPL23A, THEMIS, CCL5, RPL10, RPS3 
PC_ 2 
Positive:  BANK1, IGHM, AFF3, IGKC, RALGPS2, MS4A1, PAX5, CD74, EBF1, FCRL1 
       CD79A, OSBPL10, LINC00926, COL19A1, BLK, NIBAN3, IGHD, CD22, HLA-DRA, ADAM28 
       CD79B, PLEKHG1, COBLL1, AP002075.1, LIX1-AS1, CCSER1, TCF4, BCL11A, GNG7, STEAP1B 
Negative:  GNLY, CCL5, NKG7, CD247, IL32, PRKCH, PRF1, BCL11B, INPP4B, LEF1 
       IL7R, CAMK4, THEMIS, GZMA, RORA, GZMH, TXK, TC2N, TRBC1, PITPNC1 
       SYNE2, PDE3B, STAT4, TRAC, TGFBR3, VCAN, KLRD1, NCALD, CTSW, SLFN12L 
PC_ 3 
Positive:  LEF1, CAMK4, INPP4B, PDE3B, IL7R, FHIT, BACH2, MAML2, TSHZ2, SERINC5 
       BCL2, EEF1A1, NELL2, ANK3, CCR7, PRKCA, TCF7, BCL11B, LTB, AC139720.1 
       OXNAD1, MLLT3, RPL11, AL589693.1, TRABD2A, RPL13, RASGRF2, MAL, NR3C2, LDHB 
Negative:  GNLY, NKG7, CCL5, PRF1, GZMA, GZMH, KLRD1, SPON2, FGFBP2, CST7 
       GZMB, CCL4, TGFBR3, KLRF1, CTSW, BNC2, ADGRG1, PDGFD, PPP2R2B, PYHIN1 
       IL2RB, CLIC3, A2M, NCAM1, MCTP2, TRDC, C1orf21, SAMD3, MYBL1, FCRL6 
PC_ 4 
Positive:  TCF4, CUX2, AC023590.1, LINC01374, RHEX, LINC01478, EPHB1, PLD4, PTPRS, ZFAT 
       LINC00996, CLEC4C, LILRA4, COL26A1, PLXNA4, SCN9A, CCDC50, FAM160A1, ITM2C, COL24A1 
       UGCG, RUNX2, GZMB, NRP1, PHEX, IRF8, JCHAIN, SLC35F3, PACSIN1, AC007381.1 
Negative:  BANK1, IGHM, MS4A1, PAX5, IGKC, FCRL1, EBF1, RALGPS2, CD79A, LINC00926 
       OSBPL10, GNLY, COL19A1, IGHD, NKG7, CD22, BACH2, CCL5, PLEKHG1, CD79B 
       ADAM28, LIX1-AS1, LARGE1, STEAP1B, AP002075.1, AC120193.1, BLK, FCER2, ARHGAP24, IFNG-AS1 
PC_ 5 
Positive:  VCAN, PLCB1, ANXA1, PLXDC2, DPYD, CD36, ARHGAP26, ARHGAP24, ACSL1, AC020916.1 
       LRMDA, DYSF, CSF3R, MEGF9, PDE4D, S100A8, FNDC3B, ADAMTSL4-AS1, S100A9, TMTC2 
       RAB11FIP1, CREB5, LRRK2, NAMPT, RBM47, JUN, SLC2A3, GNLY, GLT1D1, VCAN-AS1 
Negative:  CDKN1C, FCGR3A, TCF7L2, IFITM3, CST3, MTSS1, AIF1, LST1, MS4A7, PSAP 
       WARS, ACTB, HLA-DPA1, SERPINA1, CD74, FCER1G, IFI30, COTL1, CFD, SMIM25 
       HLA-DRA, HLA-DRB1, FMNL2, CSF1R, FTL, MAFB, TMSB4X, TYROBP, HLA-DPB1, CCDC26 
06:10:45 UMAP embedding parameters a = 0.9922 b = 1.112
06:10:45 Read 10412 rows and found 50 numeric columns
06:10:45 Using Annoy for neighbor search, n_neighbors = 30
06:10:45 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:10:46 Writing NN index file to temp file /tmp/Rtmp3eAPNF/file3dd9069c738d7
06:10:46 Searching Annoy index using 1 thread, search_k = 3000
06:10:49 Annoy recall = 100%
06:10:50 Commencing smooth kNN distance calibration using 1 thread
06:10:52 Initializing from normalized Laplacian + noise
06:10:53 Commencing optimization for 200 epochs, with 443480 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:10:59 Optimization finished
There were 50 or more warnings (use warnings() to see the first 50)

分析ATAC數(shù)據(jù),ATAC數(shù)據(jù)在這里也是一個矩陣纱注,只是不是RNA含量而是fragments矩陣畏浆。但是ATAC的assays更豐富。

 str(pbmc@assays$ATAC,max.level = 2)
Formal class 'ChromatinAssay' [package "Signac"] with 16 slots
  ..@ ranges            :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  ..@ motifs            : NULL
  ..@ fragments         :List of 1
  ..@ seqinfo           :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  ..@ annotation        :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  ..@ bias              : NULL
  ..@ positionEnrichment: list()
  ..@ links             :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
  ..@ counts            :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ data              :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ scale.data        : num[0 , 0 ] 
  ..@ key               : chr "atac_"
  ..@ assay.orig        : NULL
  ..@ var.features      : chr [1:105913] "chr19-1230013-1281741" "chr19-39381822-39438328" "chr17-7832492-7859042" "chr19-12774552-12797715" ...
  ..@ meta.features     :'data.frame':  105913 obs. of  2 variables:
  ..@ misc              : NULL

如fragments矩陣信息

head(pbmc@assays$ATAC@counts)
6 x 10412 sparse Matrix of class "dgCMatrix"
   [[ suppressing 76 column names 'AAACAGCCAAGGAATC-1', 'AAACAGCCAATCCCTT-1', 'AAACAGCCAATGCGCT-1' ... ]]
                                                                                                                                                                    
chr1-10109-10357   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-180730-181630 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-191491-191736 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-267816-268196 . . . . . . . . . . . . . . . 2 . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-586028-586373 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
chr1-629721-630172 . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                               
chr1-10109-10357   . . . ......
chr1-180730-181630 . . . ......
chr1-191491-191736 . . . ......
chr1-267816-268196 . . . ......
chr1-586028-586373 . . . ......
chr1-629721-630172 . . . ......

 .....suppressing 10336 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................

如注釋信息:

 head(pbmc@assays$ATAC@annotation)
GRanges object with 6 ranges and 5 metadata columns:
                  seqnames        ranges strand |           tx_id   gene_name         gene_id   gene_biotype     type
                     <Rle>     <IRanges>  <Rle> |     <character> <character>     <character>    <character> <factor>
  ENSE00001489430     chrX 276322-276394      + | ENST00000399012      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001536003     chrX 276324-276394      + | ENST00000484611      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00002160563     chrX 276353-276394      + | ENST00000430923      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001750899     chrX 281055-281121      + | ENST00000445062      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001489388     chrX 281192-281684      + | ENST00000381657      PLCXD1 ENSG00000182378 protein_coding     exon
  ENSE00001719251     chrX 281194-281256      + | ENST00000429181      PLCXD1 ENSG00000182378 protein_coding     exon
  -------
  seqinfo: 25 sequences from hg38 genome

理解Seurat的數(shù)據(jù)是很重要的狞贱,不然很多分析數(shù)據(jù)可能不知道如何提取調(diào)用刻获,也不知道函數(shù)做了什么。在這之后瞎嬉,我們進行ATAC數(shù)據(jù)的標(biāo)準(zhǔn)分析蝎毡。

# ATAC analysis
# We exclude the first dimension as this is typically correlated with sequencing depth
DefaultAssay(pbmc) <- "ATAC"
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(pbmc, reduction = 'lsi', dims = 2:50, reduction.name = "umap.atac", reduction.key = "atacUMAP_")

注意,這里的以及所有特征選擇之后的再次降維用到的維度dims,都需要謹(jǐn)慎選擇氧枣,我們知道每個坐標(biāo)值都有一系列特征支持程度(如loading值),可以反映出該維度代表的生物學(xué)意義沐兵。如這里的dims = 2:50,排除了第一個維度便监,因為作者發(fā)現(xiàn)它通常與排序深度相關(guān)

pbmc@reductions$lsi@feature.loadings[1:4,1:4]
                              LSI_1         LSI_2         LSI_3         LSI_4
chr19-1230013-1281741   0.008759956  0.0022949972 -0.0008663089 -0.0019122743
chr19-39381822-39438328 0.008735519 -0.0007909501 -0.0018568479 -0.0018096328
chr17-7832492-7859042   0.008713235  0.0012451020 -0.0006112821  0.0004232284
chr19-12774552-12797715 0.008700547  0.0015833131 -0.0014307690  0.0013240653

用Signac分析過scATAC的同學(xué)對這里的流程不會感到陌生扎谎,如果是第一次接觸ATAC分析會覺得比較陌生了,如RunTFIDF是什么?這時候我們可以?RunTFIDF來查看具體信息以及參考文獻烧董。

?RunTFIDF


method
Which TF-IDF implementation to use. Choice of:
*   1: The TF-IDF implementation used by Stuart & Butler et al. 2019 (https://doi.org/10.1101/460147). This computes *\log(TF \times IDF)*.
*   2: The TF-IDF implementation used by Cusanovich & Hill et al. 2018 (https://doi.org/10.1016/j.cell.2018.06.052). This computes *TF \times (\log(IDF))*
*   3: The log-TF method used by Andrew Hill (http://andrewjohnhill.com/blog/2019/05/06/dimensionality-reduction-for-scatac-data/). This computes *\log(TF) \times \log(IDF)*.
*   4: The 10x Genomics method (no TF normalization). This computes *IDF*.

我們計算一個WNN圖簿透,表示RNA和ATAC-seq模式的加權(quán)組合。我們將此圖用于UMAP的可視化和聚類解藻。我們在Seurat 4.0 || 分析scRNA和表面抗體數(shù)據(jù)
中查看過FindMultiModalNeighbors之后的數(shù)據(jù)結(jié)構(gòu)老充。

pbmc <- FindMultiModalNeighbors(pbmc, reduction.list = list("pca", "lsi"), dims.list = list(1:50, 2:50))
Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s  
Calculating kernel bandwidths
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Finding multimodal neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=20s  
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Constructing multimodal KNN graph
Constructing multimodal SNN graph

 pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
06:41:32 UMAP embedding parameters a = 0.9922 b = 1.112
06:41:33 Commencing smooth kNN distance calibration using 1 thread
06:41:35 Initializing from normalized Laplacian + noise
06:41:36 Commencing optimization for 200 epochs, with 308698 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
06:41:42 Optimization finished

pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, verbose = FALSE)

這里聚類用的算法是SLM

algorithm   

Algorithm for modularity optimization 
(1 = original Louvain algorithm; 
2 = Louvain algorithm with multilevel refinement; 
3 = SLM algorithm; 
4 = Leiden algorithm). Leiden requires the leidenalg python.

這些事圖聚類的話語體系,我們知道在構(gòu)建了細(xì)胞間的圖結(jié)構(gòu)之后螟左,要聚類就需要計算細(xì)胞間的相互關(guān)系啡浊,這里是網(wǎng)略數(shù)據(jù)科學(xué)的一個重要領(lǐng)域:社區(qū)發(fā)現(xiàn)(community detection),如這里的SLM (smart local moving)算法是一種在大型網(wǎng)絡(luò)中進行社區(qū)發(fā)現(xiàn)(或聚類)的算法胶背。SLM算法最大化了所謂的模塊化函數(shù)巷嚣。該算法已成功應(yīng)用于具有數(shù)千萬個節(jié)點和數(shù)億條邊的網(wǎng)絡(luò)。算法的細(xì)節(jié)記錄在一篇論文中(Waltman & Van Eck, 2013)钳吟。在igraph中實現(xiàn)SLM可以這樣:

#devtools::install_github("chen198328/slm")

library(igraph)
library(slm)
net <- graph_from_adjacency_matrix(pbmc[["wsnn"]]) 
slm.net <-slm.community(net)

當(dāng)然這對大部分不做算法調(diào)整的同學(xué)來講并沒有什么用廷粒,這里只是說明我們的聚類是在圖空間中進行的。

接下來我們對細(xì)胞群進行注釋。注意坝茎,您還可以使用我們構(gòu)建好的PBMC參考數(shù)據(jù)集來進行注釋涤姊,可以用Seurat(vignette見官網(wǎng))或自動化web工具:Azimuth。

# perform sub-clustering on cluster 6 to find additional structure
pbmc <- FindSubCluster(pbmc, cluster = 6, graph.name = "wsnn", algorithm = 3)
Idents(pbmc) <- "sub.cluster"

指定某群在聚類這個函數(shù)好用啊

table(pbmc$seurat_clusters)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21 
2386 1159  973  606  558  502  498  470  438  438  372  352  329  285  256  198  163  142  137  106   26   18 
> table(pbmc$sub.cluster)

   0    1   10   11   12   13   14   15   16   17   18   19    2   20   21    3    4    5  6_0  6_1    7    8    9 
2386 1159  372  352  329  285  256  198  163  142  137  106  973   26   18  606  558  502  274  224  470  438  438 

其他沒變嗤放,cluster被分成了兩組:6_0 ,6_1思喊。

# add annotations
pbmc <- RenameIdents(pbmc, '19' = 'pDC','20' = 'HSPC','15' = 'cDC')
pbmc <- RenameIdents(pbmc, '0' = 'CD14 Mono', '8' ='CD14 Mono', '5' = 'CD16 Mono')
pbmc <- RenameIdents(pbmc, '17' = 'Naive B', '11' = 'Intermediate B', '10' = 'Memory B', '21' = 'Plasma')
pbmc <- RenameIdents(pbmc, '7' = 'NK')
pbmc <- RenameIdents(pbmc, '4' = 'CD4 TCM', '13'= "CD4 TEM", '3' = "CD4 TCM", '16' ="Treg", '1' ="CD4 Naive", '14' = "CD4 Naive")
pbmc <- RenameIdents(pbmc, '2' = 'CD8 Naive', '9'= "CD8 Naive", '12' = 'CD8 TEM_1', '6_0' = 'CD8 TEM_2', '6_1' ='CD8 TEM_2')
pbmc <- RenameIdents(pbmc, '18' = 'MAIT')
pbmc <- RenameIdents(pbmc, '6_2' ='gdT', '6_3' = 'gdT')
pbmc$celltype <- Idents(pbmc)


pbmc
An object of class Seurat 
166710 features across 10412 samples within 3 assays 
Active assay: ATAC (105913 features, 105913 variable features)
 2 other assays present: RNA, SCT
 5 dimensional reductions calculated: pca, umap.rna, lsi, umap.atac, wnn.umap

head(pbmc@meta.data)
                      orig.ident nCount_RNA nFeature_RNA percent.mt nCount_ATAC nFeature_ATAC replicate nCount_SCT nFeature_SCT SCT.weight wsnn_res.0.8
AAACAGCCAAGGAATC-1 SeuratProject       8380         3308   7.470167       55550         13867      para       4685         2693  0.4514450            1
AAACAGCCAATCCCTT-1 SeuratProject       3771         1896  10.527711       20485          7247      para       3781         1895  0.5069799            4
AAACAGCCAATGCGCT-1 SeuratProject       6876         2904   6.457243       16674          6528      para       4686         2769  0.4324867            1
AAACAGCCACCAACCG-1 SeuratProject       5415         2282   6.500462        7658          3323      ctrl       4486         2278  0.4738122            2
AAACAGCCAGGATAAC-1 SeuratProject       2759         1353   6.922798       10355          4267      ctrl       3332         1352  0.5361155            1
AAACAGCCAGTAGGTG-1 SeuratProject       7614         3061   6.895193       39441         11628      ctrl       4711         2764  0.5273038            9
                   seurat_clusters sub.cluster  celltype
AAACAGCCAAGGAATC-1               1           1 CD4 Naive
AAACAGCCAATCCCTT-1               4           4   CD4 TCM
AAACAGCCAATGCGCT-1               1           1 CD4 Naive
AAACAGCCACCAACCG-1               2           2 CD8 Naive
AAACAGCCAGGATAAC-1               1           1 CD4 Naive
AAACAGCCAGTAGGTG-1               9           9 CD8 Naive

基于基因表達(dá)可視化聚類,ATAC-seq次酌,或WNN分析恨课。與前面的分析相比,差異更加微妙(您可以探索權(quán)重岳服,它比我們的CITE-seq示例中更均勻地分割)剂公,但是我們發(fā)現(xiàn)WNN提供了最清晰的細(xì)胞狀態(tài)分離。


p1 <- DimPlot(pbmc, reduction = "umap.rna", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("RNA")
p2 <- DimPlot(pbmc, reduction = "umap.atac", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("ATAC")
p3 <- DimPlot(pbmc, reduction = "wnn.umap", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("WNN")
p1 + p2 + p3 & NoLegend() & theme(plot.title = element_text(hjust = 0.5))

例如吊宋,ATAC-seq數(shù)據(jù)有助于CD4和CD8 T細(xì)胞狀態(tài)的分離纲辽,這是由于多個基因座的存在。在不同的T細(xì)胞亞型之間表現(xiàn)出不同的可及性贫母。例如文兑,我們可以利用Signac可視化工具盒刚,將CD8A位點的“偽體(pseudobulk)”軌跡與基因表達(dá)水平的小提琴圖一起可視化腺劣。

## to make the visualization easier, subset T cell clusters
celltype.names <- levels(pbmc)
tcell.names <- grep("CD4|CD8|Treg", celltype.names,value = TRUE)
tcells <- subset(pbmc, idents = tcell.names)
CoveragePlot(tcells, region = 'CD8A', features = 'CD8A', assay = 'ATAC', expression.assay = 'SCT', peaks = FALSE)

接下來,我們將檢查每個細(xì)胞的可達(dá)區(qū)域因块,以確定富集基序(enriched motifs)橘原。正如Signac motifs vignette中所描述的,有幾種方法可以做到這一點涡上,但我們將使用Greenleaf實驗室的chromVAR包趾断。這將計算已知motifs的每個單元的可訪問性得分,并將這些得分作為Seurat對象中的第三個分析(chromVAR)添加吩愧。

再走下去芋酌,請確保安裝了以下軟件包。

我先library一下看看哪些是安裝過的雁佳。

library(chromVAR)
library(JASPAR2020)
library(TFBSTools)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg38)

針對沒安裝的脐帝,我們可以這樣來安裝。

remotes::install_github("immunogenomics/presto")
BiocManager::install(c("chromVAR", "TFBSTools", "JASPAR2020", "motifmatchr", "BSgenome.Hsapiens.UCSC.hg38"))  
# Scan the DNA sequence of each peak for the presence of each motif, and create a Motif object
DefaultAssay(pbmc) <- "ATAC"
pwm_set <- getMatrixSet(x = JASPAR2020, opts = list(species = 9606, all_versions = FALSE))
motif.matrix <- CreateMotifMatrix(features = granges(pbmc), pwm = pwm_set, genome = 'hg38', use.counts = FALSE)
motif.object <- CreateMotifObject(data = motif.matrix, pwm = pwm_set)
pbmc <- SetAssayData(pbmc, assay = 'ATAC', slot = 'motifs', new.data = motif.object)

# Note that this step can take 30-60 minutes 
pbmc <- RunChromVAR(
  object = pbmc,
  genome = BSgenome.Hsapiens.UCSC.hg38
)

 pbmc
An object of class Seurat 
166710 features across 10412 samples within 3 assays 
Active assay: ATAC (105913 features, 105913 variable features)
 2 other assays present: RNA, SCT
 5 dimensional reductions calculated: pca, umap.rna, lsi, umap.atac, wnn.umap

最后糖权,我們探索多模態(tài)數(shù)據(jù)集來識別每個細(xì)胞狀態(tài)的關(guān)鍵調(diào)節(jié)器(key regulators)堵腹。配對數(shù)據(jù)提供了一個獨特的機會來識別滿足多種標(biāo)準(zhǔn)的轉(zhuǎn)錄因子(TFs),幫助縮小假定的調(diào)控因子的名單列表星澳。我們的目標(biāo)是在RNA測量中識別在多種細(xì)胞類型中表達(dá)豐富的TFs疚顷,同時在ATAC測量中富集其motifs 的可達(dá)性。

作為陽性對照,CCAAT增強子結(jié)合蛋白(CEBP)家族蛋白腿堤,包括TF CEBPB阀坏,已多次證明在單核細(xì)胞和樹突狀細(xì)胞等髓系細(xì)胞的分化和功能中發(fā)揮重要作用。我們可以看到CEBPB的表達(dá)和編碼CEBPB結(jié)合位點的MA0466.2.4基序的可及性都在單核細(xì)胞中富集释液。

#returns MA0466.2
motif.name <- ConvertMotifID(pbmc, name = 'CEBPB')
gene_plot <- FeaturePlot(pbmc, features = "sct_CEBPB", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot

我們想要量化這個關(guān)系全释,并搜索所有的細(xì)胞類型來找到類似的例子。為此误债,我們將使用presto包來執(zhí)行快差異分析浸船。我們進行了兩項測試 :一項使用基因表達(dá)數(shù)據(jù),另一項使用chromVAR motif可達(dá)性寝蹈。presto根據(jù)Wilcox秩和檢驗(這也是Seurat的默認(rèn)方法)計算一個p值李命,我們將搜索限制為在兩個測試中都返回顯著結(jié)果的TFs。

presto還計算了“AUC”統(tǒng)計值箫老,它反映了每個基因(或基序)作為細(xì)胞類型標(biāo)記的能力封字。曲線下面積(AUC)的最大值為1表示標(biāo)記良好。由于基因和基序的AUC統(tǒng)計值是相同的耍鬓,因此我們將兩種測試的AUC值取平均值阔籽,并以此來對每種細(xì)胞類型的TFs進行排序:

markers_rna <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'SCT')
markers_motifs <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'chromvar')
motif.names <- markers_motifs$feature
colnames(markers_rna) <- paste0("RNA.", colnames(markers_rna))
colnames(markers_motifs) <- paste0("motif.", colnames(markers_motifs))
markers_rna$gene <- markers_rna$RNA.feature
markers_motifs$gene <- ConvertMotifID(pbmc, id = motif.names)
head(markers_motifs)
  motif.feature motif.group motif.avgExpr motif.logFC motif.statistic motif.auc   motif.pval   motif.padj motif.pct_in
1      MA0030.1   CD14 Mono    -0.2070450  -0.2837042         8748968 0.4082863 4.331768e-47 5.397656e-47          100
2      MA0031.1   CD14 Mono    -0.6256700  -0.8643914         5368273 0.2505201 0.000000e+00 0.000000e+00          100
3      MA0051.1   CD14 Mono     0.1237673   0.1974597        11955125 0.5579074 9.048795e-20 1.045235e-19          100
4      MA0057.1   CD14 Mono     0.7245745   1.0173693        16968911 0.7918847 0.000000e+00 0.000000e+00          100
5      MA0059.1   CD14 Mono     0.9826472   1.3781270        17525266 0.8178480 0.000000e+00 0.000000e+00          100
6      MA0066.1   CD14 Mono     1.6772753   2.3573633        19347910 0.9029049 0.000000e+00 0.000000e+00          100
  motif.pct_out        gene
1           100       FOXF2
2           100       FOXD1
3           100        IRF2
4           100 MZF1(var.2)
5           100    MAX::MYC
6           100       PPARG

尋找toptf的函數(shù)

# a simple function to implement the procedure above
topTFs <- function(celltype, padj.cutoff = 1e-2) {
  ctmarkers_rna <- dplyr::filter(
    markers_rna, RNA.group == celltype, RNA.padj < padj.cutoff, RNA.logFC > 0) %>% 
    arrange(-RNA.auc)
  ctmarkers_motif <- dplyr::filter(
    markers_motifs, motif.group == celltype, motif.padj < padj.cutoff, motif.logFC > 0) %>% 
    arrange(-motif.auc)
  top_tfs <- inner_join(
    x = ctmarkers_rna[, c(2, 11, 6, 7)], 
    y = ctmarkers_motif[, c(2, 1, 11, 6, 7)], by = "gene"
  )
  top_tfs$avg_auc <- (top_tfs$RNA.auc + top_tfs$motif.auc) / 2
  top_tfs <- arrange(top_tfs, -avg_auc)
  return(top_tfs)
}

我們現(xiàn)在可以計算和可視化任何細(xì)胞類型的putative regulators。我們恢復(fù)了成熟的調(diào)控因子牲蜀,包括用于NK細(xì)胞的TBX21(這里的背景知識很重要啊)笆制,漿細(xì)胞的IRF4,造血祖細(xì)胞的SOX4涣达,B細(xì)胞的EBF1和PAX5在辆,pDC的IRF8和TCF4。我們相信度苔,類似的策略可以用來幫助關(guān)注不同系統(tǒng)中一組putative regulators匆篓。


# identify top markers in NK and visualize
head(topTFs("NK"), 3)

head(topTFs("NK"), 3)
  RNA.group  gene   RNA.auc      RNA.pval motif.group motif.feature motif.auc    motif.pval   avg_auc
1        NK TBX21 0.7269867  0.000000e+00          NK      MA0690.1 0.9175550 3.489416e-206 0.8222709
2        NK EOMES 0.5898072 8.133395e-101          NK      MA0800.1 0.9238892 2.013461e-212 0.7568482
3        NK RUNX3 0.7712620 9.023079e-121          NK      MA0684.2 0.6923503  3.069778e-45 0.7318062
motif.name <- ConvertMotifID(pbmc, name = 'TBX21')
gene_plot <- FeaturePlot(pbmc, features = "sct_TBX21", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in pDC and visualize
head(topTFs("pDC"), 3)

 head(topTFs("pDC"), 3)
  RNA.group  gene   RNA.auc      RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1       pDC  TCF4 0.9998833 3.347777e-163         pDC      MA0830.2 0.9952958 3.912770e-69 0.9975896
2       pDC  IRF8 0.9905372 2.063258e-124         pDC      MA0652.1 0.8714643 1.143055e-39 0.9310008
3       pDC RUNX2 0.9754498 7.815119e-112         pDC      MA0511.2 0.8355034 1.126551e-32 0.9054766
motif.name <- ConvertMotifID(pbmc, name = 'TCF4')
gene_plot <- FeaturePlot(pbmc, features = "sct_TCF4", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in HSPC and visualize
head(topTFs("CD16 Mono"),3)

RNA.group  gene   RNA.auc      RNA.pval motif.group motif.feature motif.auc    motif.pval   avg_auc
1 CD16 Mono TBX21 0.6550599 8.750032e-193   CD16 Mono      MA0690.1 0.8896766 2.483767e-191 0.7723682
2 CD16 Mono EOMES 0.5862998  3.802279e-99   CD16 Mono      MA0800.1 0.8922904 7.043330e-194 0.7392951
3 CD16 Mono RUNX3 0.7182799  7.394573e-84   CD16 Mono      MA0684.2 0.6753424  3.175696e-40 0.6968111
motif.name <- ConvertMotifID(pbmc, name = 'SPI1')
gene_plot <- FeaturePlot(pbmc, features = "sct_SPI1", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
> # identify top markers in other cell types
> head(topTFs("Naive B"), 3)
  RNA.group   gene   RNA.auc     RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1   Naive B   EBF1 0.9604105 0.000000e+00     Naive B      MA0154.4 0.7562941 8.038044e-26 0.8583523
2   Naive B POU2F2 0.6338272 2.833732e-09     Naive B      MA0507.1 0.9702895 7.975628e-83 0.8020583
3   Naive B   TCF4 0.7139073 2.689186e-41     Naive B      MA0830.2 0.8641510 2.153784e-50 0.7890291
> head(topTFs("HSPC"), 3)
  RNA.group  gene   RNA.auc     RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1      HSPC  SOX4 0.9869221 7.766427e-71        HSPC      MA0867.2 0.6967552 5.188151e-04 0.8418387
2      HSPC GATA2 0.7884615 0.000000e+00        HSPC      MA0036.3 0.8240827 1.084319e-08 0.8062721
3      HSPC MEIS1 0.8830582 0.000000e+00        HSPC      MA0498.2 0.7048912 3.010760e-04 0.7939747
> head(topTFs("Plasma"), 3)
  RNA.group  gene   RNA.auc     RNA.pval motif.group motif.feature motif.auc   motif.pval   avg_auc
1    Plasma  IRF4 0.8189901 5.206829e-35      Plasma      MA1419.1 0.9821051 1.452382e-12 0.9005476
2    Plasma MEF2C 0.9070644 4.738760e-12      Plasma      MA0497.1 0.7525389 2.088028e-04 0.8298016
3    Plasma  TCF4 0.8052937 5.956053e-12      Plasma      MA0830.2 0.7694984 7.585013e-05 0.7873960

https://satijalab.org/seurat/v4.0/weighted_nearest_neighbor_analysis.html
https://monkeylearn.com/blog/what-is-tf-idf/
https://www.biorxiv.org/content/10.1101/460147v1.full
http://www.ludowaltman.nl/slm/#:~:text=The%20SLM%20algorithm%20has%20been%20implemented%20in%20the,be%20run%20on%20any%20system%20with%20Java%20support.
https://www.cwts.nl/blog?article=n-r2u2a4
https://github.com/chen198328/slm

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市寇窑,隨后出現(xiàn)的幾起案子鸦概,更是在濱河造成了極大的恐慌,老刑警劉巖甩骏,帶你破解...
    沈念sama閱讀 216,324評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件窗市,死亡現(xiàn)場離奇詭異,居然都是意外死亡横漏,警方通過查閱死者的電腦和手機谨设,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,356評論 3 392
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來缎浇,“玉大人扎拣,你說我怎么就攤上這事。” “怎么了二蓝?”我有些...
    開封第一講書人閱讀 162,328評論 0 353
  • 文/不壞的土叔 我叫張陵誉券,是天一觀的道長。 經(jīng)常有香客問我刊愚,道長踊跟,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,147評論 1 292
  • 正文 為了忘掉前任鸥诽,我火速辦了婚禮商玫,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘牡借。我一直安慰自己拳昌,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 67,160評論 6 388
  • 文/花漫 我一把揭開白布钠龙。 她就那樣靜靜地躺著炬藤,像睡著了一般。 火紅的嫁衣襯著肌膚如雪碴里。 梳的紋絲不亂的頭發(fā)上沈矿,一...
    開封第一講書人閱讀 51,115評論 1 296
  • 那天,我揣著相機與錄音咬腋,去河邊找鬼羹膳。 笑死,一個胖子當(dāng)著我的面吹牛帝火,可吹牛的內(nèi)容都是我干的溜徙。 我是一名探鬼主播湃缎,決...
    沈念sama閱讀 40,025評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼犀填,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了嗓违?” 一聲冷哼從身側(cè)響起九巡,我...
    開封第一講書人閱讀 38,867評論 0 274
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎蹂季,沒想到半個月后冕广,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,307評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡偿洁,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,528評論 2 332
  • 正文 我和宋清朗相戀三年撒汉,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片涕滋。...
    茶點故事閱讀 39,688評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡睬辐,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情溯饵,我是刑警寧澤侵俗,帶...
    沈念sama閱讀 35,409評論 5 343
  • 正文 年R本政府宣布,位于F島的核電站丰刊,受9級特大地震影響隘谣,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜啄巧,卻給世界環(huán)境...
    茶點故事閱讀 41,001評論 3 325
  • 文/蒙蒙 一寻歧、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧秩仆,春花似錦熄求、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,657評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至逾苫,卻和暖如春卿城,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背铅搓。 一陣腳步聲響...
    開封第一講書人閱讀 32,811評論 1 268
  • 我被黑心中介騙來泰國打工瑟押, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人星掰。 一個月前我還...
    沈念sama閱讀 47,685評論 2 368
  • 正文 我出身青樓多望,卻偏偏與公主長得像,于是被迫代替她去往敵國和親氢烘。 傳聞我的和親對象是個殘疾皇子怀偷,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,573評論 2 353