使用SnapATAC分析單細(xì)胞ATAC-seq數(shù)據(jù)(三):Integrative Analysis of PBMC scATAC-seq and scRNA-seq

image.png

簡(jiǎn)介

在本教程中,我們將對(duì)PBMC的兩個(gè)scATAC-seq數(shù)據(jù)集(5K和10K)和一個(gè)scRNA-seq數(shù)據(jù)集進(jìn)行整合分析吮廉。這三個(gè)數(shù)據(jù)集均來(lái)自10X genomics測(cè)序平臺(tái)產(chǎn)生的數(shù)據(jù),可以直接在10x官網(wǎng)下載使用仔燕。
具體來(lái)說(shuō),我們將主要執(zhí)行以下分析內(nèi)容:

  • Cell selection for PBMC 5k and 10k scATAC;
  • Randomly sample 10,000 cells as landmarks;
  • Unsupervised clustering of landmarks;
  • Project the remaining (query) cells onto the landmarks;
  • Supervised annotation of scATAC clusters using scRNA dataset;
  • Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.

分析步驟

  • Step 0. Data download
  • Step 1. Barcode selection
  • Step 2. Add cell-by-bin matrix
  • Step 3. Matrix binarization
  • Step 4. Bin filtering
  • Step 5. Dimensionality reduction of landmarks
  • Step 6. Determine significant components
  • Step 7. Graph-based clustering
  • Step 8. Visualization
  • Step 9. scRNA-seq based annotation
  • Step 10. Create psudo multiomics cells
  • Step 11. Remove cells of low prediction score
  • Step 12. Gene expression projected into UMAP
  • Step 13. Identify peak
  • Step 14. Create a cell-by-peak matrix
  • Step 15. Identify differentially accessible regions
  • Step 16. Motif variability analysis
  • Step 17. De novo motif discovery
  • Step 18. Predict gene-enhancer pairs

Step 0. Data Download

開(kāi)始分析之前竿裂,我們需要下載snaptools生成的snap文件和cellranger-atac產(chǎn)生的singlecell.csv文件矩父。

# 下載所需的數(shù)據(jù)集和基因注釋信息
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed

Step 1. Barcode selection

首先没龙,我們根據(jù)以下兩個(gè)主要的標(biāo)準(zhǔn)來(lái)選擇高質(zhì)量的barcodes:

  1. number of unique fragments;
  2. fragments in promoter ratio;
# 加載所需R包
> library(SnapATAC);
> snap.files = c(
    "atac_pbmc_5k_nextgem.snap", 
    "atac_pbmc_10k_nextgem.snap"
  );
> sample.names = c(
    "PBMC 5K",
    "PBMC 10K"
  );
> barcode.files = c(
    "atac_pbmc_5k_nextgem_singlecell.csv",
    "atac_pbmc_10k_nextgem_singlecell.csv"
  );

# 讀取snap文件
> x.sp.ls = lapply(seq(snap.files), function(i){
    createSnap(
        file=snap.files[i],
        sample=sample.names[i]
    );
  })
> names(x.sp.ls) = sample.names;

# 讀取barcode信息
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = read.csv(
        barcode.files[i], 
        head=TRUE
    );
    # remove NO BAROCDE line
    barcodes = barcodes[2:nrow(barcodes),];
    barcodes$logUMI = log10(barcodes$passed_filters + 1);
    barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
    barcodes
  })

# 質(zhì)控指標(biāo)數(shù)據(jù)可視化
> plots = lapply(seq(snap.files), function(i){
    p1 = ggplot(
        barcode.ls[[i]], 
        aes(x=logUMI, y=promoter_ratio)) + 
        geom_point(size=0.3, col="grey") +
        theme_classic() +
        ggtitle(sample.names[[i]]) +
        ylim(0, 1) + xlim(0, 6) + 
        labs(x = "log10(UMI)", y="promoter ratio")
        p1
    })
> plots
image.png
# 查看snap對(duì)象信息
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

# for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
# 設(shè)定質(zhì)控閾值進(jìn)行篩選過(guò)濾
> cutoff.logUMI.low = c(3.5, 3.5);
> cutoff.logUMI.high = c(5, 5);
> cutoff.FRIP.low = c(0.4, 0.4);
> cutoff.FRIP.high = c(0.8, 0.8);

# 根據(jù)過(guò)濾的閾值進(jìn)行barcode的篩選
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    idx = which(
        barcodes$logUMI >= cutoff.logUMI.low[i] & 
        barcodes$logUMI <= cutoff.logUMI.high[i] & 
        barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
        barcodes$promoter_ratio <= cutoff.FRIP.high[i]
    );
    barcodes[idx,]
  });

> x.sp.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    x.sp = x.sp.ls[[i]];
    barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
    x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
    barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
    x.sp@metaData = barcodes;
    x.sp
  })

> names(x.sp.ls) = sample.names;
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 4526
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
## 
## $`PBMC 10K`
## number of barcodes: 9039
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# combine two snap object

# combine two snap object
# 使用Reduce函數(shù)將兩個(gè)snap對(duì)象進(jìn)行合并
> x.sp = Reduce(snapRbind, x.sp.ls);
> x.sp@metaData["sample"] = x.sp@sample;

> x.sp
## number of barcodes: 13565
## number of bins: 0
## number of genes: 0
## number of peaks: 0

> table(x.sp@sample);
## PBMC 10K  PBMC 5K
##     9039     4526

Step 2. Add cell-by-bin matrix

接下來(lái)奢赂,我們使用addBmatToSnap函數(shù)生成5kb分辨率的cell-by-bin計(jì)數(shù)矩陣陪白,并將其添加到snap對(duì)象中。該函數(shù)將自動(dòng)從兩個(gè)snap文件中讀取cell-by-bin矩陣膳灶,并將連接后的矩陣添加到snap對(duì)象的bmat屬性中咱士。

# 使用addBmatToSnap函數(shù)計(jì)算cell-by-bin計(jì)數(shù)矩陣
> x.sp = addBmatToSnap(x.sp, bin.size=5000);

Step 3. Matrix binarization

我們使用makeBinary函數(shù)將cell-by-bin計(jì)數(shù)矩陣轉(zhuǎn)換為二進(jìn)制矩陣。在count矩陣中轧钓,某些項(xiàng)具有異常高的覆蓋率司致,這可能是由比對(duì)錯(cuò)誤造成的。因此聋迎,我們將刪除計(jì)數(shù)矩陣中top 0.1%的項(xiàng),并將其余非零的值轉(zhuǎn)換為1枣耀。

# 使用makeBinary函數(shù)將cell-by-bin計(jì)數(shù)矩陣轉(zhuǎn)換成二進(jìn)制矩陣
> x.sp = makeBinary(x.sp, mat="bmat");

Step 4. Bin filtering

首先霉晕,我們過(guò)濾掉任何與ENCODE中blacklist區(qū)域重疊的bins庭再,以防止?jié)撛诘腶rtifacts。

> library(GenomicRanges);
# 讀取blacklist信息
> black_list = read.table("hg19.blacklist.bed.gz");
> black_list.gr = GRanges(
    black_list[,1], 
    IRanges(black_list[,2], black_list[,3])
  );

> idy = queryHits(
    findOverlaps(x.sp@feature, black_list.gr)
  );
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"];
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624794
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

接下來(lái)牺堰,我們移除那些不需要的染色體信息拄轻。

> chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];

> idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"]
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624297
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

第三,bins的覆蓋率大致是服從對(duì)數(shù)正態(tài)分布的伟葫。我們將與不變特征(如管家基因的啟動(dòng)子)重疊的前5%的bins進(jìn)行刪除 恨搓。

> bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
> hist(
    bin.cov[bin.cov > 0], 
    xlab="log10(bin cov)", 
    main="log10(Bin Cov)", 
    col="lightblue", 
    xlim=c(0, 5)
  );
> bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
> idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
> x.sp = x.sp[, idy, mat="bmat"];
> x.sp
## number of barcodes: 13565
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
image.png

最后,我們將進(jìn)一步刪除bin覆蓋率小于1000的任何細(xì)胞筏养,這是因?yàn)楸M管有些細(xì)胞可能含有較高的unique fragments片段斧抱,但是過(guò)濾后的bin覆蓋率卻很低。此步驟是可選的渐溶,但強(qiáng)烈建議這樣做辉浦。

> idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
> x.sp = x.sp[idx,];
> x.sp
## number of barcodes: 13434
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

Step 5. Dimentionality reduction

SnapATAC采用diffusion maps算法進(jìn)行數(shù)據(jù)降維,這是一種非線性降維的技術(shù)茎辐,它通過(guò)對(duì)數(shù)據(jù)執(zhí)行random walk來(lái)發(fā)現(xiàn)低維的manifold宪郊,并且對(duì)噪音和擾動(dòng)具有很強(qiáng)的魯棒性。

但是拖陆,diffusion maps算法的計(jì)算成本會(huì)隨著細(xì)胞數(shù)目的增加而呈現(xiàn)指數(shù)級(jí)增長(zhǎng)的趨勢(shì)弛槐。為了克服這一局限,我們將Nystr?m方法(a sampling technique)和diffusion maps算法相結(jié)合依啰,給出Nystr?m Landmark diffusion map來(lái)生成大規(guī)模數(shù)據(jù)集的低維嵌入乎串。

Nystr?m landmark diffusion maps算法主要包括以下三個(gè)步驟:

  • sampling: sample a subset of K (K?N) cells from N total cells as “l(fā)andmarks”. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points;
  • embedding: compute a diffusion map embedding for K landmarks;
  • extension: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

在本示例中,我們將采樣10,000個(gè)細(xì)胞作為landmarks孔飒,并將余下的query細(xì)胞投射到嵌入landmarks的diffusion maps上灌闺。

> row.covs.dens <- density(
    x = x.sp@metaData[,"logUMI"], 
    bw = 'nrd', adjust = 1
  );
> sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);

> set.seed(1);
> idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));

將x.sp對(duì)象分割為landmark(x.landmark.sp)細(xì)胞和query(x.query.sp)細(xì)胞。

> x.landmark.sp = x.sp[idx.landmark.ds,];
> x.query.sp = x.sp[-idx.landmark.ds,];

Run diffusion maps on the landmark cells.
使用landmark細(xì)胞進(jìn)行diffusion maps降維處理

> x.landmark.sp = runDiffusionMaps(
    obj= x.landmark.sp,
    input.mat="bmat", 
    num.eigs=50
  );
> x.landmark.sp@metaData$landmark = 1;

Porject query cells to landmarks.
將query細(xì)胞映射到landmarks上

> x.query.sp = runDiffusionMapsExtension(
    obj1=x.landmark.sp, 
    obj2=x.query.sp,
    input.mat="bmat"
  );
> x.query.sp@metaData$landmark = 0;

Combine landmark and query cells.
合并landmark和query細(xì)胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.

> x.sp = snapRbind(x.landmark.sp, x.query.sp);
> x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT

Step 6. Determine significant components

接下來(lái)坏瞄,我們將確定特征向量(eigen-vectors)的數(shù)目桂对,以便用于后續(xù)的分析。在下面的例子中鸠匀,我們選擇前15個(gè)特征向量蕉斜。

> plotDimReductPW(
    obj=x.sp, 
    eigs.dims=1:50,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL, 
    pdf.height=7, 
    pdf.width=7
  );
image.png

Step 7. Graph-based clustering

接下來(lái),我們使用所選的特征向量成分缀棍,來(lái)構(gòu)造K近鄰(KNN)聚類圖宅此。其中,每個(gè)節(jié)點(diǎn)是一個(gè)細(xì)胞爬范,根據(jù)歐氏距離來(lái)確定每個(gè)細(xì)胞的k個(gè)近鄰細(xì)胞父腕。

> x.sp = runKNN(
    obj=x.sp,
    eigs.dims=1:20,
    k=15
  );

> library(leiden);
> x.sp=runCluster(
    obj=x.sp,
    tmp.folder=tempdir(),
    louvain.lib="leiden",
    seed.use=10,
    resolution=0.7
  );

Step 8. Visualization

SnapATAC可以使用tSNE(FI-tsne)和UMAP方法對(duì)降維聚類后的細(xì)胞進(jìn)行可視化的展示和探索。在本示例中青瀑,我們使用UMAP方法進(jìn)行展示璧亮。

> library(umap);
> x.sp = runViz(
    obj=x.sp, 
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:20, 
    method="umap",
    seed.use=10
  );

> par(mfrow = c(2, 2));
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Cluster",
    point.color=x.sp@cluster, 
    point.size=0.2, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
> plotFeatureSingle(
    obj=x.sp,
    feature.value=x.sp@metaData[,"logUMI"],
    method="umap", 
    main="Read Depth",
    point.size=0.2, 
    point.shape=19, 
    down.sample=10000,
    quantiles=c(0.01, 0.99)
  );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Sample",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@sample, 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
    );
> plotViz(
    obj= x.sp,
    method="umap", 
    main="Landmark",
    point.size=0.2, 
    point.shape=19, 
    point.color=x.sp@metaData[,"landmark"], 
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
  );
image.png

Step 9. scRNA-seq based annotation

在本示例中萧诫,我們將使用相應(yīng)的scRNA-seq數(shù)據(jù)集來(lái)對(duì)scATAC-seq數(shù)據(jù)的細(xì)胞類群進(jìn)行注釋。我們可以通過(guò)(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下載所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat對(duì)象枝嘶。

> library(Seurat);
> pbmc.rna = readRDS("pbmc_10k_v3.rds");
> pbmc.rna$tech = "rna";
> variable.genes = VariableFeatures(object = pbmc.rna);

> genes.df = read.table("gencode.v19.annotation.gene.bed");
> genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
> genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];

## reload the bmat, this is optional but highly recommanded
> x.sp = addBmatToSnap(x.sp);
> x.sp = createGmatFromMat(
    obj=x.sp, 
    input.mat="bmat",
    genes=genes.sel.gr,
    do.par=TRUE,
    num.cores=10
  );

接下來(lái)帘饶,我們將snap對(duì)象轉(zhuǎn)換為Seurat對(duì)象,用于后續(xù)的數(shù)據(jù)整合群扶。

# 使用snapToSeurat函數(shù)將snap對(duì)象轉(zhuǎn)換為Seurat對(duì)象
> pbmc.atac <- snapToSeurat(
    obj=x.sp, 
    eigs.dims=1:20, 
    norm=TRUE,
    scale=TRUE
  );

# 識(shí)別整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
    reference = pbmc.rna, 
    query = pbmc.atac, 
    features = variable.genes, 
    reference.assay = "RNA", 
    query.assay = "ACTIVITY", 
    reduction = "cca"
  );

# 使用TransferData函數(shù)進(jìn)行數(shù)據(jù)映射整合
> celltype.predictions <- TransferData(
    anchorset = transfer.anchors, 
    refdata = pbmc.rna$celltype,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);

Step 10. Create psudo multiomics cells

現(xiàn)在及刻,snap對(duì)象x.sp中包含了的每個(gè)細(xì)胞的染色質(zhì)可及性@bmat和基因表達(dá)@gmat的信息。

> refdata <- GetAssayData(
    object = pbmc.rna, 
    assay = "RNA", 
    slot = "data"
  );
> imputation <- TransferData(
    anchorset = transfer.anchors, 
    refdata = refdata, 
    weight.reduction = pbmc.atac[["SnapATAC"]], 
    dims = 1:20
  );
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata);    # free memory
> rm(pbmc.rna);   # free memory
> rm(pbmc.atac); # free memory

Step 11. Remove cells of low prediction score

> hist(
    x.sp@metaData$predict.max.score, 
    xlab="prediction score", 
    col="lightblue", 
    xlim=c(0, 1),
    main="PBMC 10X"
  );
> abline(v=0.5, col="red", lwd=2, lty=2);

> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE  TRUE
##  331 13103

> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0

> plotViz(
    obj=x.sp,
    method="umap", 
    main="PBMC 10X",
    point.color=x.sp@metaData[,"predicted.id"], 
    point.size=0.5, 
    point.shape=19, 
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
image.png

Step 12. Gene expression projected onto UMAP

接下來(lái)竞阐,我們使用UMAP方法對(duì)一些marker基因的表達(dá)水平進(jìn)行可視化展示缴饭。

> marker.genes = c(
    "IL32", "LTB", "CD3D",
    "IL7R", "LDHB", "FCGR3A", 
    "CD68", "MS4A1", "GNLY", 
    "CD3E", "CD14", "CD14", 
    "FCGR3A", "LYZ", "PPBP", 
    "CD8A", "PPBP", "CST3", 
    "NKG7", "MS4A7", "MS4A1", 
    "CD8A"
  );

> par(mfrow = c(3, 3));
> for(i in 1:9){
    j = which(colnames(x.sp@gmat) == marker.genes[i])
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@gmat[,j],
        method="umap", 
        main=marker.genes[i],
        point.size=0.1, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
 )};
image.png

Step 13. Identify peaks

接下來(lái),我們將每個(gè)cluster群中的reads進(jìn)行聚合馁菜,創(chuàng)建用于peak calling和可視化的集成track茴扁。執(zhí)行完peak calling后,將生成一個(gè).narrowPeak的文件汪疮,該文件中包含了識(shí)別出的peaks信息峭火,和一個(gè).bedgraph的文件,可以用于可視化展示智嚷。為了獲得最robust的結(jié)果卖丸,我們不建議對(duì)細(xì)胞數(shù)目小于100的cluster群執(zhí)行此步驟。

> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2

# 使用runMACS函數(shù)調(diào)用macs2進(jìn)行peak calling
> peaks = runMACS(
    obj=x.sp[which(x.sp@cluster=="CD4 Naive"),], 
    output.prefix="PBMC.CD4_Naive",
    path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
    path.to.macs="/home/r3fang/anaconda2/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500, 
    num.cores=10,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
    );

為了對(duì)所有的cluster群執(zhí)行peak calling盏道,我們提供了一個(gè)簡(jiǎn)便的腳本來(lái)實(shí)現(xiàn)該步驟稍浆。

# call peaks for all cluster with more than 100 cells
# 選出那些細(xì)胞數(shù)目大于100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];

# 批量對(duì)不同的cluster進(jìn)行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
    print(clusters.sel[i]);
    peaks = runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
        output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="hs", # mm, hs, etc
        buffer.size=500, 
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=tempdir()
   );
    peaks
    }, mc.cores=5);

> peaks.names = system("ls | grep narrowPeak", intern=TRUE);

> peak.gr.ls = lapply(peaks.names, function(x){
    peak.df = read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
  })

# 合并peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));

我們可以將每個(gè)cluster群生成的bdg文件導(dǎo)入到IGV或其他基因組瀏覽器(如UW genome browser)進(jìn)行可視化的展示,下面是來(lái)自UW genome browser的FOXJ2基因及其側(cè)翼區(qū)域的screenshot猜嘱。


image.png

Step 14. Create a cell-by-peak matrix

接下來(lái)衅枫,我們使用合并后的peaks作為參考,來(lái)創(chuàng)建一個(gè)cell-by-peak計(jì)數(shù)矩陣朗伶,并將其添加到snap對(duì)象中弦撩。
首先,我們將合并后的peaks信息寫(xiě)入到peaks.combined.bed文件中

> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")

接下來(lái)论皆,我們創(chuàng)建一個(gè)cell-by-peak計(jì)數(shù)矩陣益楼,并將其添加到snap對(duì)象中。這一步可能需要一段時(shí)間点晴。

$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_10k_nextgem.snap \
    --peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_5k_nextgem.snap \
    --peak-file peaks.combined.bed  

然后感凤,我們將計(jì)算好的cell-by-peak計(jì)數(shù)矩陣添加到已有的snap對(duì)象中。

> x.sp = addPmatToSnap(x.sp);

Step 15. Identify differentially accessible peaks

對(duì)于一組給定的細(xì)胞Ci粒督,我們首先在diffusion component空間中尋找鄰近的細(xì)胞Cj (|Ci|=|Cj|)作為“background”細(xì)胞進(jìn)行比較陪竿。如果Ci細(xì)胞的數(shù)目占總細(xì)胞數(shù)目的一半以上,則使用剩余的細(xì)胞作為local background屠橄。接下來(lái)萨惑,我們將Ci和Cj細(xì)胞進(jìn)行聚合捐康,來(lái)創(chuàng)建兩個(gè)raw-count向量,即Vci和Vcj庸蔼。然后,我們使用R包edgeR (v3.18.1)的精確檢驗(yàn)(exact test)來(lái)對(duì)Vci和Vcj進(jìn)行差異分析贮匕,其中姐仅,對(duì)于小鼠設(shè)置BCV=0.1,而人類設(shè)置BCV= 0.4刻盐。最后掏膏,再使用Benjamini-Hochberg多重檢驗(yàn)校正方法,來(lái)調(diào)整p-value值為錯(cuò)誤發(fā)現(xiàn)率(FDR)值敦锌。對(duì)于FDR值小于0.05的peaks馒疹,被選為顯著的DARs。

我們發(fā)現(xiàn)這種方法也存在一定的局限性乙墙,特別是對(duì)于那些細(xì)胞數(shù)目較少的cluster群颖变,統(tǒng)計(jì)的有效性可能不夠強(qiáng)大。對(duì)于那些未能識(shí)別出顯著差異peaks的cluster群听想,我們將根據(jù)富集p-value值對(duì)元素進(jìn)行排序腥刹,并挑選出最具代表性的2,000個(gè)peaks進(jìn)行后續(xù)的motif分析。

> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="CD14+ Monocytes",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC, 
    pch=19, cex=0.1, col="grey", 
    ylab="logFC", xlab="logCPM",
    main="CD14+ Monocytes"
  );
> points(DARs$logCPM[idy], 
    DARs$logFC[idy], 
    pch=19, 
    cex=0.5, 
    col="red"
  );
> abline(h = 0, lwd=1, lty=2);

> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);

> plotFeatureSingle(
    obj=x.sp,
    feature.value=vals.zscore,
    method="umap", 
    main="CD14+ Monocytes",
    point.size=0.1, 
    point.shape=19, 
    down.sample=5000,
    quantiles=c(0.01, 0.99)
  );

Step 16. Motif variability analysis

SnapATAC可以調(diào)用chromVAR(Schep et al)程序進(jìn)行motif可變性分析汉买。

# 加載所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);

> x.sp = makeBinary(x.sp, "pmat");

# 使用runChromVAR函數(shù)調(diào)用ChromVAR進(jìn)行motif可變性分析
> x.sp@mmat = runChromVAR(
    obj=x.sp,
    input.mat="pmat",
    genome=BSgenome.Hsapiens.UCSC.hg19,
    min.count=10,
    species="Homo sapiens"
  );
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271

> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
    theme_classic() +
    geom_violin() + 
    xlab("cluster") +
    ylab("motif enrichment") + 
    ggtitle("MA0071.1_RORA") +
    theme(
          plot.margin = margin(5,1,5,1, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1),
          axis.ticks.x=element_blank(),
          legend.position = "none"
   );
image.png

Step 17. De novo motif discovery

SnapATAC還可以調(diào)用homer用于對(duì)識(shí)別出的差異可及性區(qū)域(DARs)進(jìn)行motif的識(shí)別和富集分析.

# 查看findMotifsGenome.pl程序安裝的路徑
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl

# 使用findDAR函數(shù)鑒定DAR區(qū)域
> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="Double negative T cell",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

# 使用runHomer函數(shù)調(diào)用homer進(jìn)行de novo motif discovery
> motifs = runHomer(
    x.sp[,idy,"pmat"], 
    mat = "pmat",
    path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
    result.dir = "./homer/DoubleNegativeTcell",
    num.cores=5,
    genome = 'hg19',
    motif.length = 10,
    scan.size = 300,
    optimize.count = 2,
    background = 'automatic',
    local.background = FALSE,
    only.known = TRUE,
    only.denovo = FALSE,
    fdr.num = 5,
    cache = 100,
    overwrite = TRUE,
    keep.minimal = FALSE
    );
image.png

Step 18. Predict gene-enhancer pairs

最后衔峰,我們還可以使用一種“pseudo”細(xì)胞的方法,根據(jù)單個(gè)細(xì)胞中基因的表達(dá)和遠(yuǎn)端調(diào)控元件的染色質(zhì)可及性的關(guān)系蛙粘,將遠(yuǎn)端調(diào)控元件連接到目標(biāo)基因上垫卤。對(duì)于給定的marker基因,我們首先識(shí)別出marker基因側(cè)翼區(qū)域內(nèi)的peak出牧。對(duì)于每個(gè)側(cè)翼的peak穴肘,我們使用基因表達(dá)作為輸入變量進(jìn)行邏輯回歸來(lái)預(yù)測(cè)binarized的染色質(zhì)狀態(tài)。產(chǎn)生的結(jié)果估計(jì)了染色質(zhì)可及性與基因表達(dá)之間聯(lián)系的重要性崔列。

> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));

> pairs = predictGenePeakPair(
    x.sp, 
    input.mat="pmat",
    gene.name="C3AR1", 
    gene.loci=resize(TSS.loci, width=500000, fix="center"),
    do.par=FALSE
    );

# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
    chr1=pairs.df[,"seqnames"],
    start1=pairs.df[,"start"],
    end1=pairs.df[,"end"],
    chr2="chr2",
    start2=8219067,
    end2=8219068,
    Pval=pairs.df[,"logPval"]
    );
> head(pairs.df)
##    chr1  start1    end1 chr2  start2    end2       Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068  5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068  0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068  2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068  0.0000000
image.png

參考來(lái)源:https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末梢褐,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子赵讯,更是在濱河造成了極大的恐慌盈咳,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件边翼,死亡現(xiàn)場(chǎng)離奇詭異鱼响,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)组底,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門丈积,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)筐骇,“玉大人,你說(shuō)我怎么就攤上這事江滨☆跷常” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,370評(píng)論 0 342
  • 文/不壞的土叔 我叫張陵唬滑,是天一觀的道長(zhǎng)告唆。 經(jīng)常有香客問(wèn)我,道長(zhǎng)晶密,這世上最難降的妖魔是什么擒悬? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,168評(píng)論 1 278
  • 正文 為了忘掉前任,我火速辦了婚禮稻艰,結(jié)果婚禮上懂牧,老公的妹妹穿的比我還像新娘。我一直安慰自己尊勿,他們只是感情好僧凤,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著运怖,像睡著了一般拼弃。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上摇展,一...
    開(kāi)封第一講書(shū)人閱讀 48,954評(píng)論 1 283
  • 那天吻氧,我揣著相機(jī)與錄音,去河邊找鬼咏连。 笑死盯孙,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的祟滴。 我是一名探鬼主播振惰,決...
    沈念sama閱讀 38,271評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼垄懂!你這毒婦竟也來(lái)了骑晶?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 36,916評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤草慧,失蹤者是張志新(化名)和其女友劉穎桶蛔,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體漫谷,經(jīng)...
    沈念sama閱讀 43,382評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡仔雷,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評(píng)論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片碟婆。...
    茶點(diǎn)故事閱讀 37,989評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡电抚,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出竖共,到底是詐尸還是另有隱情蝙叛,我是刑警寧澤,帶...
    沈念sama閱讀 33,624評(píng)論 4 322
  • 正文 年R本政府宣布公给,位于F島的核電站甥温,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏妓布。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評(píng)論 3 307
  • 文/蒙蒙 一宋梧、第九天 我趴在偏房一處隱蔽的房頂上張望匣沼。 院中可真熱鬧,春花似錦捂龄、人聲如沸释涛。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,199評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)唇撬。三九已至,卻和暖如春展融,著一層夾襖步出監(jiān)牢的瞬間窖认,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,418評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工告希, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留扑浸,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,401評(píng)論 2 352
  • 正文 我出身青樓燕偶,卻偏偏與公主長(zhǎng)得像喝噪,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子指么,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評(píng)論 2 345