使用SnapATAC分析單細(xì)胞ATAC-seq數(shù)據(jù)（三）：Integrative Analysis of PBMC scATAC-seq and scRNA-seq

image.png

簡(jiǎn)介

在本教程中，我們將對(duì)PBMC的兩個(gè)scATAC-seq數(shù)據(jù)集(5K和10K)和一個(gè)scRNA-seq數(shù)據(jù)集進(jìn)行整合分析吮廉。這三個(gè)數(shù)據(jù)集均來(lái)自10X genomics測(cè)序平臺(tái)產(chǎn)生的數(shù)據(jù)，可以直接在10x官網(wǎng)下載使用仔燕。
具體來(lái)說(shuō)，我們將主要執(zhí)行以下分析內(nèi)容：

Cell selection for PBMC 5k and 10k scATAC;
Randomly sample 10,000 cells as landmarks;
Unsupervised clustering of landmarks;
Project the remaining (query) cells onto the landmarks;
Supervised annotation of scATAC clusters using scRNA dataset;
Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.

分析步驟

Step 0. Data download
Step 1. Barcode selection
Step 2. Add cell-by-bin matrix
Step 3. Matrix binarization
Step 4. Bin filtering
Step 5. Dimensionality reduction of landmarks
Step 6. Determine significant components
Step 7. Graph-based clustering
Step 8. Visualization
Step 9. scRNA-seq based annotation
Step 10. Create psudo multiomics cells
Step 11. Remove cells of low prediction score
Step 12. Gene expression projected into UMAP
Step 13. Identify peak
Step 14. Create a cell-by-peak matrix
Step 15. Identify differentially accessible regions
Step 16. Motif variability analysis
Step 17. De novo motif discovery
Step 18. Predict gene-enhancer pairs

Step 0. Data Download

開(kāi)始分析之前竿裂，我們需要下載snaptools生成的snap文件和cellranger-atac產(chǎn)生的singlecell.csv文件矩父。

# 下載所需的數(shù)據(jù)集和基因注釋信息
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed

Step 1. Barcode selection

首先没龙，我們根據(jù)以下兩個(gè)主要的標(biāo)準(zhǔn)來(lái)選擇高質(zhì)量的barcodes：

number of unique fragments;
fragments in promoter ratio;

# 加載所需R包
> library(SnapATAC);
> snap.files = c(
    "atac_pbmc_5k_nextgem.snap", 
    "atac_pbmc_10k_nextgem.snap"
  );
> sample.names = c(
    "PBMC 5K",
    "PBMC 10K"
  );
> barcode.files = c(
    "atac_pbmc_5k_nextgem_singlecell.csv",
    "atac_pbmc_10k_nextgem_singlecell.csv"
  );

# 讀取snap文件
> x.sp.ls = lapply(seq(snap.files), function(i){
    createSnap(
        file=snap.files[i],
        sample=sample.names[i]
    );
  })
> names(x.sp.ls) = sample.names;

# 讀取barcode信息
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = read.csv(
        barcode.files[i], 
        head=TRUE
    );
    # remove NO BAROCDE line
    barcodes = barcodes[2:nrow(barcodes),];
    barcodes$logUMI = log10(barcodes$passed_filters + 1);
    barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
    barcodes
  })

# 質(zhì)控指標(biāo)數(shù)據(jù)可視化
> plots = lapply(seq(snap.files), function(i){
    p1 = ggplot(
        barcode.ls[[i]], 
        aes(x=logUMI, y=promoter_ratio)) + 
        geom_point(size=0.3, col="grey") +
        theme_classic() +
        ggtitle(sample.names[[i]]) +
        ylim(0, 1) + xlim(0, 6) + 
        labs(x = "log10(UMI)", y="promoter ratio")
        p1
    })
> plots

image.png

# 查看snap對(duì)象信息
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

# for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
# 設(shè)定質(zhì)控閾值進(jìn)行篩選過(guò)濾
> cutoff.logUMI.low = c(3.5, 3.5);
> cutoff.logUMI.high = c(5, 5);
> cutoff.FRIP.low = c(0.4, 0.4);
> cutoff.FRIP.high = c(0.8, 0.8);

# 根據(jù)過(guò)濾的閾值進(jìn)行barcode的篩選
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    idx = which(
        barcodes$logUMI >= cutoff.logUMI.low[i] & 
        barcodes$logUMI <= cutoff.logUMI.high[i] & 
        barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
        barcodes$promoter_ratio <= cutoff.FRIP.high[i]
    );
    barcodes[idx,]
  });

> x.sp.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    x.sp = x.sp.ls[[i]];
    barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
    x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
    barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
    x.sp@metaData = barcodes;
    x.sp
  })

> names(x.sp.ls) = sample.names;
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 4526
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 9039
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# combine two snap object

# combine two snap object
# 使用Reduce函數(shù)將兩個(gè)snap對(duì)象進(jìn)行合并
> x.sp = Reduce(snapRbind, x.sp.ls);
> x.sp@metaData["sample"] = x.sp@sample;

> x.sp
## number of barcodes: 13565
## number of bins: 0
## number of genes: 0
## number of peaks: 0

> table(x.sp@sample);
## PBMC 10K  PBMC 5K
##     9039     4526

Step 2. Add cell-by-bin matrix

接下來(lái)奢赂，我們使用addBmatToSnap函數(shù)生成5kb分辨率的cell-by-bin計(jì)數(shù)矩陣陪白，并將其添加到snap對(duì)象中。該函數(shù)將自動(dòng)從兩個(gè)snap文件中讀取cell-by-bin矩陣膳灶，并將連接后的矩陣添加到snap對(duì)象的bmat屬性中咱士。

# 使用addBmatToSnap函數(shù)計(jì)算cell-by-bin計(jì)數(shù)矩陣
> x.sp = addBmatToSnap(x.sp, bin.size=5000);

Step 3. Matrix binarization

我們使用makeBinary函數(shù)將cell-by-bin計(jì)數(shù)矩陣轉(zhuǎn)換為二進(jìn)制矩陣。在count矩陣中轧钓，某些項(xiàng)具有異常高的覆蓋率司致，這可能是由比對(duì)錯(cuò)誤造成的。因此聋迎，我們將刪除計(jì)數(shù)矩陣中top 0.1%的項(xiàng)，并將其余非零的值轉(zhuǎn)換為1枣耀。

# 使用makeBinary函數(shù)將cell-by-bin計(jì)數(shù)矩陣轉(zhuǎn)換成二進(jìn)制矩陣
> x.sp = makeBinary(x.sp, mat="bmat");

Step 4. Bin filtering

首先霉晕，我們過(guò)濾掉任何與ENCODE中blacklist區(qū)域重疊的bins庭再，以防止?jié)撛诘腶rtifacts。

> library(GenomicRanges);
# 讀取blacklist信息
> black_list = read.table("hg19.blacklist.bed.gz");
> black_list.gr = GRanges(
    black_list[,1], 
    IRanges(black_list[,2], black_list[,3])
  );

> idy = queryHits(
    findOverlaps(x.sp@feature, black_list.gr)
  );
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"];
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624794
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

接下來(lái)牺堰，我們移除那些不需要的染色體信息拄轻。

> chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];

> idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"]
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624297
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

第三，bins的覆蓋率大致是服從對(duì)數(shù)正態(tài)分布的伟葫。我們將與不變特征(如管家基因的啟動(dòng)子)重疊的前5%的bins進(jìn)行刪除恨搓。

> bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
> hist(
    bin.cov[bin.cov > 0], 
    xlab="log10(bin cov)", 
    main="log10(Bin Cov)", 
    col="lightblue", 
    xlim=c(0, 5)
  );
> bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
> idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
> x.sp = x.sp[, idy, mat="bmat"];
> x.sp
## number of barcodes: 13565
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

image.png

最后，我們將進(jìn)一步刪除bin覆蓋率小于1000的任何細(xì)胞筏养，這是因?yàn)楸M管有些細(xì)胞可能含有較高的unique fragments片段斧抱，但是過(guò)濾后的bin覆蓋率卻很低。此步驟是可選的渐溶，但強(qiáng)烈建議這樣做辉浦。

> idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
> x.sp = x.sp[idx,];
> x.sp
## number of barcodes: 13434
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

Step 5. Dimentionality reduction

SnapATAC采用diffusion maps算法進(jìn)行數(shù)據(jù)降維，這是一種非線性降維的技術(shù)茎辐，它通過(guò)對(duì)數(shù)據(jù)執(zhí)行random walk來(lái)發(fā)現(xiàn)低維的manifold宪郊，并且對(duì)噪音和擾動(dòng)具有很強(qiáng)的魯棒性。

但是拖陆，diffusion maps算法的計(jì)算成本會(huì)隨著細(xì)胞數(shù)目的增加而呈現(xiàn)指數(shù)級(jí)增長(zhǎng)的趨勢(shì)弛槐。為了克服這一局限，我們將Nystr?m方法(a sampling technique)和diffusion maps算法相結(jié)合依啰，給出Nystr?m Landmark diffusion map來(lái)生成大規(guī)模數(shù)據(jù)集的低維嵌入乎串。

Nystr?m landmark diffusion maps算法主要包括以下三個(gè)步驟：

sampling: sample a subset of K (K?N) cells from N total cells as “l(fā)andmarks”. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points;
embedding: compute a diffusion map embedding for K landmarks;
extension: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

在本示例中，我們將采樣10,000個(gè)細(xì)胞作為landmarks孔飒，并將余下的query細(xì)胞投射到嵌入landmarks的diffusion maps上灌闺。

> row.covs.dens <- density(
    x = x.sp@metaData[,"logUMI"], 
    bw = 'nrd', adjust = 1
  );
> sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);

> set.seed(1);
> idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));

將x.sp對(duì)象分割為landmark(x.landmark.sp)細(xì)胞和query(x.query.sp)細(xì)胞。

> x.landmark.sp = x.sp[idx.landmark.ds,];
> x.query.sp = x.sp[-idx.landmark.ds,];

Run diffusion maps on the landmark cells.
使用landmark細(xì)胞進(jìn)行diffusion maps降維處理

> x.landmark.sp = runDiffusionMaps(
    obj= x.landmark.sp,
    input.mat="bmat", 
    num.eigs=50
  );
> x.landmark.sp@metaData$landmark = 1;

Porject query cells to landmarks.
將query細(xì)胞映射到landmarks上

> x.query.sp = runDiffusionMapsExtension(
    obj1=x.landmark.sp, 
    obj2=x.query.sp,
    input.mat="bmat"
  );
> x.query.sp@metaData$landmark = 0;

Combine landmark and query cells.
合并landmark和query細(xì)胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.

> x.sp = snapRbind(x.landmark.sp, x.query.sp);
> x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT

Step 6. Determine significant components

接下來(lái)坏瞄，我們將確定特征向量（eigen-vectors）的數(shù)目桂对，以便用于后續(xù)的分析。在下面的例子中鸠匀，我們選擇前15個(gè)特征向量蕉斜。

> plotDimReductPW(
    obj=x.sp, 
    eigs.dims=1:50,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL, 
    pdf.height=7, 
    pdf.width=7
  );

image.png

Step 7. Graph-based clustering

接下來(lái)，我們使用所選的特征向量成分缀棍，來(lái)構(gòu)造K近鄰（KNN）聚類圖宅此。其中，每個(gè)節(jié)點(diǎn)是一個(gè)細(xì)胞爬范，根據(jù)歐氏距離來(lái)確定每個(gè)細(xì)胞的k個(gè)近鄰細(xì)胞父腕。

> x.sp = runKNN(
    obj=x.sp,
    eigs.dims=1:20,
    k=15
  );

> library(leiden);
> x.sp=runCluster(
    obj=x.sp,
    tmp.folder=tempdir(),
    louvain.lib="leiden",
    seed.use=10,
    resolution=0.7
  );

Step 8. Visualization

SnapATAC可以使用tSNE(FI-tsne)和UMAP方法對(duì)降維聚類后的細(xì)胞進(jìn)行可視化的展示和探索。在本示例中青瀑，我們使用UMAP方法進(jìn)行展示璧亮。

> library(umap);
> x.sp = runViz(
    obj=x.sp, 
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:20, 
    method="umap",
    seed.use=10
  );

> par(mfrow = c(2, 2));
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Cluster",
    point.color=x.sp@cluster, 
    point.size=0.2, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
> plotFeatureSingle(
    obj=x.sp,
    feature.value=x.sp@metaData[,"logUMI"],
    method="umap", 
    main="Read Depth",
    point.size=0.2, 
    point.shape=19, 
    down.sample=10000,
    quantiles=c(0.01, 0.99)
  );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Sample",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@sample, 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
    );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Landmark",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@metaData[,"landmark"], 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
  );

image.png

Step 9. scRNA-seq based annotation

在本示例中萧诫，我們將使用相應(yīng)的scRNA-seq數(shù)據(jù)集來(lái)對(duì)scATAC-seq數(shù)據(jù)的細(xì)胞類群進(jìn)行注釋。我們可以通過(guò)(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下載所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat對(duì)象枝嘶。

> library(Seurat);
> pbmc.rna = readRDS("pbmc_10k_v3.rds");
> pbmc.rna$tech = "rna";
> variable.genes = VariableFeatures(object = pbmc.rna);

> genes.df = read.table("gencode.v19.annotation.gene.bed");
> genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
> genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];

## reload the bmat, this is optional but highly recommanded
> x.sp = addBmatToSnap(x.sp);
> x.sp = createGmatFromMat(
    obj=x.sp, 
    input.mat="bmat",
    genes=genes.sel.gr,
    do.par=TRUE,
    num.cores=10
  );

接下來(lái)帘饶，我們將snap對(duì)象轉(zhuǎn)換為Seurat對(duì)象，用于后續(xù)的數(shù)據(jù)整合群扶。

# 使用snapToSeurat函數(shù)將snap對(duì)象轉(zhuǎn)換為Seurat對(duì)象
> pbmc.atac <- snapToSeurat(
    obj=x.sp, 
    eigs.dims=1:20, 
    norm=TRUE,
    scale=TRUE
  );

# 識(shí)別整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
    reference = pbmc.rna, 
    query = pbmc.atac, 
    features = variable.genes, 
    reference.assay = "RNA", 
    query.assay = "ACTIVITY", 
    reduction = "cca"
  );

# 使用TransferData函數(shù)進(jìn)行數(shù)據(jù)映射整合
> celltype.predictions <- TransferData(
    anchorset = transfer.anchors, 
    refdata = pbmc.rna$celltype,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);

Step 10. Create psudo multiomics cells

現(xiàn)在及刻，snap對(duì)象x.sp中包含了的每個(gè)細(xì)胞的染色質(zhì)可及性@bmat和基因表達(dá)@gmat的信息。

> refdata <- GetAssayData(
    object = pbmc.rna, 
    assay = "RNA", 
    slot = "data"
  );
> imputation <- TransferData(
    anchorset = transfer.anchors, 
    refdata = refdata, 
    weight.reduction = pbmc.atac[["SnapATAC"]], 
    dims = 1:20
  );
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata);    # free memory
> rm(pbmc.rna);   # free memory
> rm(pbmc.atac); # free memory

Step 11. Remove cells of low prediction score

> hist(
    x.sp@metaData$predict.max.score, 
    xlab="prediction score", 
    col="lightblue", 
    xlim=c(0, 1),
    main="PBMC 10X"
  );
> abline(v=0.5, col="red", lwd=2, lty=2);

> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE  TRUE
##  331 13103

> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0

> plotViz(
    obj=x.sp,
    method="umap", 
    main="PBMC 10X",
    point.color=x.sp@metaData[,"predicted.id"], 
    point.size=0.5, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );

image.png

Step 12. Gene expression projected onto UMAP

接下來(lái)竞阐，我們使用UMAP方法對(duì)一些marker基因的表達(dá)水平進(jìn)行可視化展示缴饭。

> marker.genes = c(
    "IL32", "LTB", "CD3D",
    "IL7R", "LDHB", "FCGR3A", 
    "CD68", "MS4A1", "GNLY", 
    "CD3E", "CD14", "CD14", 
    "FCGR3A", "LYZ", "PPBP", 
    "CD8A", "PPBP", "CST3", 
    "NKG7", "MS4A7", "MS4A1", 
    "CD8A"
  );

> par(mfrow = c(3, 3));
> for(i in 1:9){
    j = which(colnames(x.sp@gmat) == marker.genes[i])
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@gmat[,j],
        method="umap", 
        main=marker.genes[i],
        point.size=0.1, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
 )};

image.png

Step 13. Identify peaks

接下來(lái)，我們將每個(gè)cluster群中的reads進(jìn)行聚合馁菜，創(chuàng)建用于peak calling和可視化的集成track茴扁。執(zhí)行完peak calling后，將生成一個(gè).narrowPeak的文件汪疮，該文件中包含了識(shí)別出的peaks信息峭火，和一個(gè).bedgraph的文件，可以用于可視化展示智嚷。為了獲得最robust的結(jié)果卖丸，我們不建議對(duì)細(xì)胞數(shù)目小于100的cluster群執(zhí)行此步驟。

> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2

# 使用runMACS函數(shù)調(diào)用macs2進(jìn)行peak calling
> peaks = runMACS(
    obj=x.sp[which(x.sp@cluster=="CD4 Naive"),], 
    output.prefix="PBMC.CD4_Naive",
    path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
    path.to.macs="/home/r3fang/anaconda2/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500, 
    num.cores=10,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
    );

為了對(duì)所有的cluster群執(zhí)行peak calling盏道，我們提供了一個(gè)簡(jiǎn)便的腳本來(lái)實(shí)現(xiàn)該步驟稍浆。

# call peaks for all cluster with more than 100 cells
# 選出那些細(xì)胞數(shù)目大于100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];

# 批量對(duì)不同的cluster進(jìn)行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
    print(clusters.sel[i]);
    peaks = runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
        output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="hs", # mm, hs, etc
        buffer.size=500, 
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=tempdir()
   );
    peaks
    }, mc.cores=5);

> peaks.names = system("ls | grep narrowPeak", intern=TRUE);

> peak.gr.ls = lapply(peaks.names, function(x){
    peak.df = read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
  })

# 合并peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));

我們可以將每個(gè)cluster群生成的bdg文件導(dǎo)入到IGV或其他基因組瀏覽器(如UW genome browser)進(jìn)行可視化的展示，下面是來(lái)自UW genome browser的FOXJ2基因及其側(cè)翼區(qū)域的screenshot猜嘱。

image.png

Step 14. Create a cell-by-peak matrix

接下來(lái)衅枫，我們使用合并后的peaks作為參考，來(lái)創(chuàng)建一個(gè)cell-by-peak計(jì)數(shù)矩陣朗伶，并將其添加到snap對(duì)象中弦撩。
首先，我們將合并后的peaks信息寫(xiě)入到peaks.combined.bed文件中

> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")

接下來(lái)论皆，我們創(chuàng)建一個(gè)cell-by-peak計(jì)數(shù)矩陣益楼，并將其添加到snap對(duì)象中。這一步可能需要一段時(shí)間点晴。

$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_10k_nextgem.snap \
    --peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_5k_nextgem.snap \
    --peak-file peaks.combined.bed

然后感凤，我們將計(jì)算好的cell-by-peak計(jì)數(shù)矩陣添加到已有的snap對(duì)象中。

> x.sp = addPmatToSnap(x.sp);

Step 15. Identify differentially accessible peaks

對(duì)于一組給定的細(xì)胞Ci粒督，我們首先在diffusion component空間中尋找鄰近的細(xì)胞Cj (|Ci|=|Cj|)作為“background”細(xì)胞進(jìn)行比較陪竿。如果Ci細(xì)胞的數(shù)目占總細(xì)胞數(shù)目的一半以上，則使用剩余的細(xì)胞作為local background屠橄。接下來(lái)萨惑，我們將Ci和Cj細(xì)胞進(jìn)行聚合捐康，來(lái)創(chuàng)建兩個(gè)raw-count向量，即Vci和Vcj庸蔼。然后，我們使用R包edgeR (v3.18.1)的精確檢驗(yàn)（exact test）來(lái)對(duì)Vci和Vcj進(jìn)行差異分析贮匕，其中姐仅，對(duì)于小鼠設(shè)置BCV=0.1，而人類設(shè)置BCV= 0.4刻盐。最后掏膏，再使用Benjamini-Hochberg多重檢驗(yàn)校正方法，來(lái)調(diào)整p-value值為錯(cuò)誤發(fā)現(xiàn)率(FDR)值敦锌。對(duì)于FDR值小于0.05的peaks馒疹，被選為顯著的DARs。

我們發(fā)現(xiàn)這種方法也存在一定的局限性乙墙，特別是對(duì)于那些細(xì)胞數(shù)目較少的cluster群颖变，統(tǒng)計(jì)的有效性可能不夠強(qiáng)大。對(duì)于那些未能識(shí)別出顯著差異peaks的cluster群听想，我們將根據(jù)富集p-value值對(duì)元素進(jìn)行排序腥刹，并挑選出最具代表性的2,000個(gè)peaks進(jìn)行后續(xù)的motif分析。

> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="CD14+ Monocytes",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC, 
    pch=19, cex=0.1, col="grey", 
    ylab="logFC", xlab="logCPM",
    main="CD14+ Monocytes"
  );
> points(DARs$logCPM[idy], 
    DARs$logFC[idy], 
    pch=19, 
    cex=0.5, 
    col="red"
  );
> abline(h = 0, lwd=1, lty=2);

> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);

> plotFeatureSingle(
    obj=x.sp,
    feature.value=vals.zscore,
    method="umap", 
    main="CD14+ Monocytes",
    point.size=0.1, 
    point.shape=19, 
    down.sample=5000,
    quantiles=c(0.01, 0.99)
  );

Step 16. Motif variability analysis

SnapATAC可以調(diào)用chromVAR(Schep et al)程序進(jìn)行motif可變性分析汉买。

# 加載所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);

> x.sp = makeBinary(x.sp, "pmat");

# 使用runChromVAR函數(shù)調(diào)用ChromVAR進(jìn)行motif可變性分析
> x.sp@mmat = runChromVAR(
    obj=x.sp,
    input.mat="pmat",
    genome=BSgenome.Hsapiens.UCSC.hg19,
    min.count=10,
    species="Homo sapiens"
  );
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271

> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
    theme_classic() +
    geom_violin() + 
    xlab("cluster") +
    ylab("motif enrichment") + 
    ggtitle("MA0071.1_RORA") +
    theme(
          plot.margin = margin(5,1,5,1, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1),
          axis.ticks.x=element_blank(),
          legend.position = "none"
   );

image.png

Step 17. De novo motif discovery

SnapATAC還可以調(diào)用homer用于對(duì)識(shí)別出的差異可及性區(qū)域(DARs)進(jìn)行motif的識(shí)別和富集分析.

# 查看findMotifsGenome.pl程序安裝的路徑
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl

# 使用findDAR函數(shù)鑒定DAR區(qū)域
> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="Double negative T cell",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

# 使用runHomer函數(shù)調(diào)用homer進(jìn)行de novo motif discovery
> motifs = runHomer(
    x.sp[,idy,"pmat"], 
    mat = "pmat",
    path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
    result.dir = "./homer/DoubleNegativeTcell",
    num.cores=5,
    genome = 'hg19',
    motif.length = 10,
    scan.size = 300,
    optimize.count = 2,
    background = 'automatic',
    local.background = FALSE,
    only.known = TRUE,
    only.denovo = FALSE,
    fdr.num = 5,
    cache = 100,
    overwrite = TRUE,
    keep.minimal = FALSE
    );

image.png

Step 18. Predict gene-enhancer pairs

最后衔峰，我們還可以使用一種“pseudo”細(xì)胞的方法，根據(jù)單個(gè)細(xì)胞中基因的表達(dá)和遠(yuǎn)端調(diào)控元件的染色質(zhì)可及性的關(guān)系蛙粘，將遠(yuǎn)端調(diào)控元件連接到目標(biāo)基因上垫卤。對(duì)于給定的marker基因，我們首先識(shí)別出marker基因側(cè)翼區(qū)域內(nèi)的peak出牧。對(duì)于每個(gè)側(cè)翼的peak穴肘，我們使用基因表達(dá)作為輸入變量進(jìn)行邏輯回歸來(lái)預(yù)測(cè)binarized的染色質(zhì)狀態(tài)。產(chǎn)生的結(jié)果估計(jì)了染色質(zhì)可及性與基因表達(dá)之間聯(lián)系的重要性崔列。

> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));

> pairs = predictGenePeakPair(
    x.sp, 
    input.mat="pmat",
    gene.name="C3AR1", 
    gene.loci=resize(TSS.loci, width=500000, fix="center"),
    do.par=FALSE
    );

# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
    chr1=pairs.df[,"seqnames"],
    start1=pairs.df[,"start"],
    end1=pairs.df[,"end"],
    chr2="chr2",
    start2=8219067,
    end2=8219068,
    Pval=pairs.df[,"logPval"]
    );
> head(pairs.df)
##    chr1  start1    end1 chr2  start2    end2       Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068  5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068  0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068  2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068  0.0000000

image.png

參考來(lái)源：https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md

最后編輯于：2020.09.17 12:26:13

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末梢褐，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子赵讯，更是在濱河造成了極大的恐慌盈咳，老刑警劉巖，帶你破解...
沈念sama閱讀 206,013評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件边翼，死亡現(xiàn)場(chǎng)離奇詭異鱼响，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)组底，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門丈积，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)筐骇，“玉大人，你說(shuō)我怎么就攤上這事江滨☆跷常” “怎么了？”我有些...
開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵唬滑，是天一觀的道長(zhǎng)告唆。經(jīng)常有香客問(wèn)我，道長(zhǎng)晶密，這世上最難降的妖魔是什么擒悬？我笑而不...
開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮稻艰，結(jié)果婚禮上懂牧，老公的妹妹穿的比我還像新娘。我一直安慰自己尊勿，他們只是感情好僧凤，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布。她就那樣靜靜地躺著运怖，像睡著了一般拼弃。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上摇展，一...
開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1贊 283
城市分裂傳說(shuō)
那天吻氧，我揣著相機(jī)與錄音，去河邊找鬼咏连。笑死盯孙，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的祟滴。我是一名探鬼主播振惰，決...
沈念sama閱讀 38,271評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼垄懂！你這毒婦竟也來(lái)了骑晶？” 一聲冷哼從身側(cè)響起，我...
開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤草慧，失蹤者是張志新（化名）和其女友劉穎桶蛔，沒(méi)想到半個(gè)月后，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體漫谷，經(jīng)...
沈念sama閱讀 43,382評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡仔雷，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片碟婆。...
茶點(diǎn)故事閱讀 37,989評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡电抚，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出竖共，到底是詐尸還是另有隱情蝙叛，我是刑警寧澤，帶...
沈念sama閱讀 33,624評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布公给，位于F島的核電站甥温，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏妓布。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一宋梧、第九天我趴在偏房一處隱蔽的房頂上張望匣沼。院中可真熱鬧，春花似錦捂龄、人聲如沸释涛。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0贊 19
一樁弒父案倦沧，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)唇撬。三九已至，卻和暖如春展融，著一層夾襖步出監(jiān)牢的瞬間窖认，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工告希，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留扑浸，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,401評(píng)論 2贊 352
代替公主和親
正文我出身青樓燕偶，卻偏偏與公主長(zhǎng)得像喝噪，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子指么，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評(píng)論 2贊 345