簡(jiǎn)介
在本教程中,我們將對(duì)PBMC的兩個(gè)scATAC-seq數(shù)據(jù)集(5K和10K)和一個(gè)scRNA-seq數(shù)據(jù)集進(jìn)行整合分析吮廉。這三個(gè)數(shù)據(jù)集均來(lái)自10X genomics測(cè)序平臺(tái)產(chǎn)生的數(shù)據(jù),可以直接在10x官網(wǎng)下載使用仔燕。
具體來(lái)說(shuō),我們將主要執(zhí)行以下分析內(nèi)容:
- Cell selection for PBMC 5k and 10k scATAC;
- Randomly sample 10,000 cells as landmarks;
- Unsupervised clustering of landmarks;
- Project the remaining (query) cells onto the landmarks;
- Supervised annotation of scATAC clusters using scRNA dataset;
- Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.
分析步驟
- Step 0. Data download
- Step 1. Barcode selection
- Step 2. Add cell-by-bin matrix
- Step 3. Matrix binarization
- Step 4. Bin filtering
- Step 5. Dimensionality reduction of landmarks
- Step 6. Determine significant components
- Step 7. Graph-based clustering
- Step 8. Visualization
- Step 9. scRNA-seq based annotation
- Step 10. Create psudo multiomics cells
- Step 11. Remove cells of low prediction score
- Step 12. Gene expression projected into UMAP
- Step 13. Identify peak
- Step 14. Create a cell-by-peak matrix
- Step 15. Identify differentially accessible regions
- Step 16. Motif variability analysis
- Step 17. De novo motif discovery
- Step 18. Predict gene-enhancer pairs
Step 0. Data Download
開(kāi)始分析之前竿裂,我們需要下載snaptools生成的snap文件和cellranger-atac產(chǎn)生的singlecell.csv文件矩父。
# 下載所需的數(shù)據(jù)集和基因注釋信息
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed
Step 1. Barcode selection
首先没龙,我們根據(jù)以下兩個(gè)主要的標(biāo)準(zhǔn)來(lái)選擇高質(zhì)量的barcodes:
- number of unique fragments;
- fragments in promoter ratio;
# 加載所需R包
> library(SnapATAC);
> snap.files = c(
"atac_pbmc_5k_nextgem.snap",
"atac_pbmc_10k_nextgem.snap"
);
> sample.names = c(
"PBMC 5K",
"PBMC 10K"
);
> barcode.files = c(
"atac_pbmc_5k_nextgem_singlecell.csv",
"atac_pbmc_10k_nextgem_singlecell.csv"
);
# 讀取snap文件
> x.sp.ls = lapply(seq(snap.files), function(i){
createSnap(
file=snap.files[i],
sample=sample.names[i]
);
})
> names(x.sp.ls) = sample.names;
# 讀取barcode信息
> barcode.ls = lapply(seq(snap.files), function(i){
barcodes = read.csv(
barcode.files[i],
head=TRUE
);
# remove NO BAROCDE line
barcodes = barcodes[2:nrow(barcodes),];
barcodes$logUMI = log10(barcodes$passed_filters + 1);
barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
barcodes
})
# 質(zhì)控指標(biāo)數(shù)據(jù)可視化
> plots = lapply(seq(snap.files), function(i){
p1 = ggplot(
barcode.ls[[i]],
aes(x=logUMI, y=promoter_ratio)) +
geom_point(size=0.3, col="grey") +
theme_classic() +
ggtitle(sample.names[[i]]) +
ylim(0, 1) + xlim(0, 6) +
labs(x = "log10(UMI)", y="promoter ratio")
p1
})
> plots
# 查看snap對(duì)象信息
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
##
## $`PBMC 10K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
# 設(shè)定質(zhì)控閾值進(jìn)行篩選過(guò)濾
> cutoff.logUMI.low = c(3.5, 3.5);
> cutoff.logUMI.high = c(5, 5);
> cutoff.FRIP.low = c(0.4, 0.4);
> cutoff.FRIP.high = c(0.8, 0.8);
# 根據(jù)過(guò)濾的閾值進(jìn)行barcode的篩選
> barcode.ls = lapply(seq(snap.files), function(i){
barcodes = barcode.ls[[i]];
idx = which(
barcodes$logUMI >= cutoff.logUMI.low[i] &
barcodes$logUMI <= cutoff.logUMI.high[i] &
barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
barcodes$promoter_ratio <= cutoff.FRIP.high[i]
);
barcodes[idx,]
});
> x.sp.ls = lapply(seq(snap.files), function(i){
barcodes = barcode.ls[[i]];
x.sp = x.sp.ls[[i]];
barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
x.sp@metaData = barcodes;
x.sp
})
> names(x.sp.ls) = sample.names;
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 4526
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
##
## $`PBMC 10K`
## number of barcodes: 9039
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# combine two snap object
# combine two snap object
# 使用Reduce函數(shù)將兩個(gè)snap對(duì)象進(jìn)行合并
> x.sp = Reduce(snapRbind, x.sp.ls);
> x.sp@metaData["sample"] = x.sp@sample;
> x.sp
## number of barcodes: 13565
## number of bins: 0
## number of genes: 0
## number of peaks: 0
> table(x.sp@sample);
## PBMC 10K PBMC 5K
## 9039 4526
Step 2. Add cell-by-bin matrix
接下來(lái)奢赂,我們使用addBmatToSnap函數(shù)生成5kb分辨率的cell-by-bin計(jì)數(shù)矩陣陪白,并將其添加到snap對(duì)象中。該函數(shù)將自動(dòng)從兩個(gè)snap文件中讀取cell-by-bin矩陣膳灶,并將連接后的矩陣添加到snap對(duì)象的bmat屬性中咱士。
# 使用addBmatToSnap函數(shù)計(jì)算cell-by-bin計(jì)數(shù)矩陣
> x.sp = addBmatToSnap(x.sp, bin.size=5000);
Step 3. Matrix binarization
我們使用makeBinary函數(shù)將cell-by-bin計(jì)數(shù)矩陣轉(zhuǎn)換為二進(jìn)制矩陣。在count矩陣中轧钓,某些項(xiàng)具有異常高的覆蓋率司致,這可能是由比對(duì)錯(cuò)誤造成的。因此聋迎,我們將刪除計(jì)數(shù)矩陣中top 0.1%的項(xiàng),并將其余非零的值轉(zhuǎn)換為1枣耀。
# 使用makeBinary函數(shù)將cell-by-bin計(jì)數(shù)矩陣轉(zhuǎn)換成二進(jìn)制矩陣
> x.sp = makeBinary(x.sp, mat="bmat");
Step 4. Bin filtering
首先霉晕,我們過(guò)濾掉任何與ENCODE中blacklist區(qū)域重疊的bins庭再,以防止?jié)撛诘腶rtifacts。
> library(GenomicRanges);
# 讀取blacklist信息
> black_list = read.table("hg19.blacklist.bed.gz");
> black_list.gr = GRanges(
black_list[,1],
IRanges(black_list[,2], black_list[,3])
);
> idy = queryHits(
findOverlaps(x.sp@feature, black_list.gr)
);
> if(length(idy) > 0){
x.sp = x.sp[,-idy, mat="bmat"];
};
> x.sp
## number of barcodes: 13565
## number of bins: 624794
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
接下來(lái)牺堰,我們移除那些不需要的染色體信息拄轻。
> chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];
> idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
> if(length(idy) > 0){
x.sp = x.sp[,-idy, mat="bmat"]
};
> x.sp
## number of barcodes: 13565
## number of bins: 624297
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
第三,bins的覆蓋率大致是服從對(duì)數(shù)正態(tài)分布的伟葫。我們將與不變特征(如管家基因的啟動(dòng)子)重疊的前5%的bins進(jìn)行刪除 恨搓。
> bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
> hist(
bin.cov[bin.cov > 0],
xlab="log10(bin cov)",
main="log10(Bin Cov)",
col="lightblue",
xlim=c(0, 5)
);
> bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
> idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
> x.sp = x.sp[, idy, mat="bmat"];
> x.sp
## number of barcodes: 13565
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
最后,我們將進(jìn)一步刪除bin覆蓋率小于1000的任何細(xì)胞筏养,這是因?yàn)楸M管有些細(xì)胞可能含有較高的unique fragments片段斧抱,但是過(guò)濾后的bin覆蓋率卻很低。此步驟是可選的渐溶,但強(qiáng)烈建議這樣做辉浦。
> idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
> x.sp = x.sp[idx,];
> x.sp
## number of barcodes: 13434
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
Step 5. Dimentionality reduction
SnapATAC采用diffusion maps算法進(jìn)行數(shù)據(jù)降維,這是一種非線性降維的技術(shù)茎辐,它通過(guò)對(duì)數(shù)據(jù)執(zhí)行random walk來(lái)發(fā)現(xiàn)低維的manifold宪郊,并且對(duì)噪音和擾動(dòng)具有很強(qiáng)的魯棒性。
但是拖陆,diffusion maps算法的計(jì)算成本會(huì)隨著細(xì)胞數(shù)目的增加而呈現(xiàn)指數(shù)級(jí)增長(zhǎng)的趨勢(shì)弛槐。為了克服這一局限,我們將Nystr?m方法(a sampling technique)和diffusion maps算法相結(jié)合依啰,給出Nystr?m Landmark diffusion map來(lái)生成大規(guī)模數(shù)據(jù)集的低維嵌入乎串。
Nystr?m landmark diffusion maps算法主要包括以下三個(gè)步驟:
-
sampling
: sample a subset of K (K?N) cells from N total cells as “l(fā)andmarks”. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points; -
embedding
: compute a diffusion map embedding for K landmarks; -
extension
: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.
在本示例中,我們將采樣10,000個(gè)細(xì)胞作為landmarks孔飒,并將余下的query細(xì)胞投射到嵌入landmarks的diffusion maps上灌闺。
> row.covs.dens <- density(
x = x.sp@metaData[,"logUMI"],
bw = 'nrd', adjust = 1
);
> sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);
> set.seed(1);
> idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));
將x.sp對(duì)象分割為landmark(x.landmark.sp)細(xì)胞和query(x.query.sp)細(xì)胞。
> x.landmark.sp = x.sp[idx.landmark.ds,];
> x.query.sp = x.sp[-idx.landmark.ds,];
Run diffusion maps on the landmark cells.
使用landmark細(xì)胞進(jìn)行diffusion maps降維處理
> x.landmark.sp = runDiffusionMaps(
obj= x.landmark.sp,
input.mat="bmat",
num.eigs=50
);
> x.landmark.sp@metaData$landmark = 1;
Porject query cells to landmarks.
將query細(xì)胞映射到landmarks上
> x.query.sp = runDiffusionMapsExtension(
obj1=x.landmark.sp,
obj2=x.query.sp,
input.mat="bmat"
);
> x.query.sp@metaData$landmark = 0;
Combine landmark and query cells.
合并landmark和query細(xì)胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.
> x.sp = snapRbind(x.landmark.sp, x.query.sp);
> x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT
Step 6. Determine significant components
接下來(lái)坏瞄,我們將確定特征向量(eigen-vectors)的數(shù)目桂对,以便用于后續(xù)的分析。在下面的例子中鸠匀,我們選擇前15個(gè)特征向量蕉斜。
> plotDimReductPW(
obj=x.sp,
eigs.dims=1:50,
point.size=0.3,
point.color="grey",
point.shape=19,
point.alpha=0.6,
down.sample=5000,
pdf.file.name=NULL,
pdf.height=7,
pdf.width=7
);
Step 7. Graph-based clustering
接下來(lái),我們使用所選的特征向量成分缀棍,來(lái)構(gòu)造K近鄰(KNN)聚類圖宅此。其中,每個(gè)節(jié)點(diǎn)是一個(gè)細(xì)胞爬范,根據(jù)歐氏距離來(lái)確定每個(gè)細(xì)胞的k個(gè)近鄰細(xì)胞父腕。
> x.sp = runKNN(
obj=x.sp,
eigs.dims=1:20,
k=15
);
> library(leiden);
> x.sp=runCluster(
obj=x.sp,
tmp.folder=tempdir(),
louvain.lib="leiden",
seed.use=10,
resolution=0.7
);
Step 8. Visualization
SnapATAC可以使用tSNE(FI-tsne)和UMAP方法對(duì)降維聚類后的細(xì)胞進(jìn)行可視化的展示和探索。在本示例中青瀑,我們使用UMAP方法進(jìn)行展示璧亮。
> library(umap);
> x.sp = runViz(
obj=x.sp,
tmp.folder=tempdir(),
dims=2,
eigs.dims=1:20,
method="umap",
seed.use=10
);
> par(mfrow = c(2, 2));
> plotViz(
obj= x.sp,
method="umap",
main="Cluster",
point.color=x.sp@cluster,
point.size=0.2,
point.shape=19,
text.add=TRUE,
text.size=1,
text.color="black",
down.sample=10000,
legend.add=FALSE
);
> plotFeatureSingle(
obj=x.sp,
feature.value=x.sp@metaData[,"logUMI"],
method="umap",
main="Read Depth",
point.size=0.2,
point.shape=19,
down.sample=10000,
quantiles=c(0.01, 0.99)
);
> plotViz(
obj= x.sp,
method="umap",
main="Sample",
point.size=0.2,
point.shape=19,
point.color=x.sp@sample,
text.add=FALSE,
text.size=1.5,
text.color="black",
down.sample=10000,
legend.add=TRUE
);
> plotViz(
obj= x.sp,
method="umap",
main="Landmark",
point.size=0.2,
point.shape=19,
point.color=x.sp@metaData[,"landmark"],
text.add=FALSE,
text.size=1.5,
text.color="black",
down.sample=10000,
legend.add=TRUE
);
Step 9. scRNA-seq based annotation
在本示例中萧诫,我們將使用相應(yīng)的scRNA-seq數(shù)據(jù)集來(lái)對(duì)scATAC-seq數(shù)據(jù)的細(xì)胞類群進(jìn)行注釋。我們可以通過(guò)(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下載所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat對(duì)象枝嘶。
> library(Seurat);
> pbmc.rna = readRDS("pbmc_10k_v3.rds");
> pbmc.rna$tech = "rna";
> variable.genes = VariableFeatures(object = pbmc.rna);
> genes.df = read.table("gencode.v19.annotation.gene.bed");
> genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
> genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];
## reload the bmat, this is optional but highly recommanded
> x.sp = addBmatToSnap(x.sp);
> x.sp = createGmatFromMat(
obj=x.sp,
input.mat="bmat",
genes=genes.sel.gr,
do.par=TRUE,
num.cores=10
);
接下來(lái)帘饶,我們將snap對(duì)象轉(zhuǎn)換為Seurat對(duì)象,用于后續(xù)的數(shù)據(jù)整合群扶。
# 使用snapToSeurat函數(shù)將snap對(duì)象轉(zhuǎn)換為Seurat對(duì)象
> pbmc.atac <- snapToSeurat(
obj=x.sp,
eigs.dims=1:20,
norm=TRUE,
scale=TRUE
);
# 識(shí)別整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
reference = pbmc.rna,
query = pbmc.atac,
features = variable.genes,
reference.assay = "RNA",
query.assay = "ACTIVITY",
reduction = "cca"
);
# 使用TransferData函數(shù)進(jìn)行數(shù)據(jù)映射整合
> celltype.predictions <- TransferData(
anchorset = transfer.anchors,
refdata = pbmc.rna$celltype,
weight.reduction = pbmc.atac[["SnapATAC"]],
dims = 1:20
);
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);
Step 10. Create psudo multiomics cells
現(xiàn)在及刻,snap對(duì)象x.sp中包含了的每個(gè)細(xì)胞的染色質(zhì)可及性@bmat和基因表達(dá)@gmat的信息。
> refdata <- GetAssayData(
object = pbmc.rna,
assay = "RNA",
slot = "data"
);
> imputation <- TransferData(
anchorset = transfer.anchors,
refdata = refdata,
weight.reduction = pbmc.atac[["SnapATAC"]],
dims = 1:20
);
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata); # free memory
> rm(pbmc.rna); # free memory
> rm(pbmc.atac); # free memory
Step 11. Remove cells of low prediction score
> hist(
x.sp@metaData$predict.max.score,
xlab="prediction score",
col="lightblue",
xlim=c(0, 1),
main="PBMC 10X"
);
> abline(v=0.5, col="red", lwd=2, lty=2);
> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE TRUE
## 331 13103
> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0
> plotViz(
obj=x.sp,
method="umap",
main="PBMC 10X",
point.color=x.sp@metaData[,"predicted.id"],
point.size=0.5,
point.shape=19,
text.add=TRUE,
text.size=1,
text.color="black",
down.sample=10000,
legend.add=FALSE
);
Step 12. Gene expression projected onto UMAP
接下來(lái)竞阐,我們使用UMAP方法對(duì)一些marker基因的表達(dá)水平進(jìn)行可視化展示缴饭。
> marker.genes = c(
"IL32", "LTB", "CD3D",
"IL7R", "LDHB", "FCGR3A",
"CD68", "MS4A1", "GNLY",
"CD3E", "CD14", "CD14",
"FCGR3A", "LYZ", "PPBP",
"CD8A", "PPBP", "CST3",
"NKG7", "MS4A7", "MS4A1",
"CD8A"
);
> par(mfrow = c(3, 3));
> for(i in 1:9){
j = which(colnames(x.sp@gmat) == marker.genes[i])
plotFeatureSingle(
obj=x.sp,
feature.value=x.sp@gmat[,j],
method="umap",
main=marker.genes[i],
point.size=0.1,
point.shape=19,
down.sample=10000,
quantiles=c(0.01, 0.99)
)};
Step 13. Identify peaks
接下來(lái),我們將每個(gè)cluster群中的reads進(jìn)行聚合馁菜,創(chuàng)建用于peak calling和可視化的集成track茴扁。執(zhí)行完peak calling后,將生成一個(gè).narrowPeak的文件汪疮,該文件中包含了識(shí)別出的peaks信息峭火,和一個(gè).bedgraph的文件,可以用于可視化展示智嚷。為了獲得最robust的結(jié)果卖丸,我們不建議對(duì)細(xì)胞數(shù)目小于100的cluster群執(zhí)行此步驟。
> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2
# 使用runMACS函數(shù)調(diào)用macs2進(jìn)行peak calling
> peaks = runMACS(
obj=x.sp[which(x.sp@cluster=="CD4 Naive"),],
output.prefix="PBMC.CD4_Naive",
path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
path.to.macs="/home/r3fang/anaconda2/bin/macs2",
gsize="hs", # mm, hs, etc
buffer.size=500,
num.cores=10,
macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
tmp.folder=tempdir()
);
為了對(duì)所有的cluster群執(zhí)行peak calling盏道,我們提供了一個(gè)簡(jiǎn)便的腳本來(lái)實(shí)現(xiàn)該步驟稍浆。
# call peaks for all cluster with more than 100 cells
# 選出那些細(xì)胞數(shù)目大于100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];
# 批量對(duì)不同的cluster進(jìn)行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
print(clusters.sel[i]);
peaks = runMACS(
obj=x.sp[which(x.sp@cluster==clusters.sel[i]),],
output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
path.to.macs="/home/r3fang/anaconda2/bin/macs2",
gsize="hs", # mm, hs, etc
buffer.size=500,
num.cores=1,
macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
tmp.folder=tempdir()
);
peaks
}, mc.cores=5);
> peaks.names = system("ls | grep narrowPeak", intern=TRUE);
> peak.gr.ls = lapply(peaks.names, function(x){
peak.df = read.table(x)
GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
})
# 合并peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));
我們可以將每個(gè)cluster群生成的bdg文件導(dǎo)入到IGV或其他基因組瀏覽器(如UW genome browser)進(jìn)行可視化的展示,下面是來(lái)自UW genome browser的FOXJ2基因及其側(cè)翼區(qū)域的screenshot猜嘱。
Step 14. Create a cell-by-peak matrix
接下來(lái)衅枫,我們使用合并后的peaks作為參考,來(lái)創(chuàng)建一個(gè)cell-by-peak計(jì)數(shù)矩陣朗伶,并將其添加到snap對(duì)象中弦撩。
首先,我們將合并后的peaks信息寫(xiě)入到peaks.combined.bed文件中
> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".",
row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
fileEncoding = "")
接下來(lái)论皆,我們創(chuàng)建一個(gè)cell-by-peak計(jì)數(shù)矩陣益楼,并將其添加到snap對(duì)象中。這一步可能需要一段時(shí)間点晴。
$ snaptools snap-add-pmat \
--snap-file atac_pbmc_10k_nextgem.snap \
--peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
--snap-file atac_pbmc_5k_nextgem.snap \
--peak-file peaks.combined.bed
然后感凤,我們將計(jì)算好的cell-by-peak計(jì)數(shù)矩陣添加到已有的snap對(duì)象中。
> x.sp = addPmatToSnap(x.sp);
Step 15. Identify differentially accessible peaks
對(duì)于一組給定的細(xì)胞Ci粒督,我們首先在diffusion component空間中尋找鄰近的細(xì)胞Cj (|Ci|=|Cj|)作為“background”細(xì)胞進(jìn)行比較陪竿。如果Ci細(xì)胞的數(shù)目占總細(xì)胞數(shù)目的一半以上,則使用剩余的細(xì)胞作為local background屠橄。接下來(lái)萨惑,我們將Ci和Cj細(xì)胞進(jìn)行聚合捐康,來(lái)創(chuàng)建兩個(gè)raw-count向量,即Vci和Vcj庸蔼。然后,我們使用R包edgeR (v3.18.1)的精確檢驗(yàn)(exact test)來(lái)對(duì)Vci和Vcj進(jìn)行差異分析贮匕,其中姐仅,對(duì)于小鼠設(shè)置BCV=0.1,而人類設(shè)置BCV= 0.4刻盐。最后掏膏,再使用Benjamini-Hochberg多重檢驗(yàn)校正方法,來(lái)調(diào)整p-value值為錯(cuò)誤發(fā)現(xiàn)率(FDR)值敦锌。對(duì)于FDR值小于0.05的peaks馒疹,被選為顯著的DARs。
我們發(fā)現(xiàn)這種方法也存在一定的局限性乙墙,特別是對(duì)于那些細(xì)胞數(shù)目較少的cluster群颖变,統(tǒng)計(jì)的有效性可能不夠強(qiáng)大。對(duì)于那些未能識(shí)別出顯著差異peaks的cluster群听想,我們將根據(jù)富集p-value值對(duì)元素進(jìn)行排序腥刹,并挑選出最具代表性的2,000個(gè)peaks進(jìn)行后續(xù)的motif分析。
> DARs = findDAR(
obj=x.sp,
input.mat="pmat",
cluster.pos="CD14+ Monocytes",
cluster.neg.method="knn",
test.method="exactTest",
bcv=0.4, #0.4 for human, 0.1 for mouse
seed.use=10
);
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);
> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC,
pch=19, cex=0.1, col="grey",
ylab="logFC", xlab="logCPM",
main="CD14+ Monocytes"
);
> points(DARs$logCPM[idy],
DARs$logFC[idy],
pch=19,
cex=0.5,
col="red"
);
> abline(h = 0, lwd=1, lty=2);
> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);
> plotFeatureSingle(
obj=x.sp,
feature.value=vals.zscore,
method="umap",
main="CD14+ Monocytes",
point.size=0.1,
point.shape=19,
down.sample=5000,
quantiles=c(0.01, 0.99)
);
Step 16. Motif variability analysis
SnapATAC可以調(diào)用chromVAR(Schep et al)程序進(jìn)行motif可變性分析汉买。
# 加載所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);
> x.sp = makeBinary(x.sp, "pmat");
# 使用runChromVAR函數(shù)調(diào)用ChromVAR進(jìn)行motif可變性分析
> x.sp@mmat = runChromVAR(
obj=x.sp,
input.mat="pmat",
genome=BSgenome.Hsapiens.UCSC.hg19,
min.count=10,
species="Homo sapiens"
);
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271
> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) +
theme_classic() +
geom_violin() +
xlab("cluster") +
ylab("motif enrichment") +
ggtitle("MA0071.1_RORA") +
theme(
plot.margin = margin(5,1,5,1, "cm"),
axis.text.x = element_text(angle = 90, hjust = 1),
axis.ticks.x=element_blank(),
legend.position = "none"
);
Step 17. De novo motif discovery
SnapATAC還可以調(diào)用homer用于對(duì)識(shí)別出的差異可及性區(qū)域(DARs)進(jìn)行motif的識(shí)別和富集分析.
# 查看findMotifsGenome.pl程序安裝的路徑
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl
# 使用findDAR函數(shù)鑒定DAR區(qū)域
> DARs = findDAR(
obj=x.sp,
input.mat="pmat",
cluster.pos="Double negative T cell",
cluster.neg.method="knn",
test.method="exactTest",
bcv=0.4, #0.4 for human, 0.1 for mouse
seed.use=10
);
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);
# 使用runHomer函數(shù)調(diào)用homer進(jìn)行de novo motif discovery
> motifs = runHomer(
x.sp[,idy,"pmat"],
mat = "pmat",
path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
result.dir = "./homer/DoubleNegativeTcell",
num.cores=5,
genome = 'hg19',
motif.length = 10,
scan.size = 300,
optimize.count = 2,
background = 'automatic',
local.background = FALSE,
only.known = TRUE,
only.denovo = FALSE,
fdr.num = 5,
cache = 100,
overwrite = TRUE,
keep.minimal = FALSE
);
Step 18. Predict gene-enhancer pairs
最后衔峰,我們還可以使用一種“pseudo”細(xì)胞的方法,根據(jù)單個(gè)細(xì)胞中基因的表達(dá)和遠(yuǎn)端調(diào)控元件的染色質(zhì)可及性的關(guān)系蛙粘,將遠(yuǎn)端調(diào)控元件連接到目標(biāo)基因上垫卤。對(duì)于給定的marker基因,我們首先識(shí)別出marker基因側(cè)翼區(qū)域內(nèi)的peak出牧。對(duì)于每個(gè)側(cè)翼的peak穴肘,我們使用基因表達(dá)作為輸入變量進(jìn)行邏輯回歸來(lái)預(yù)測(cè)binarized的染色質(zhì)狀態(tài)。產(chǎn)生的結(jié)果估計(jì)了染色質(zhì)可及性與基因表達(dá)之間聯(lián)系的重要性崔列。
> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));
> pairs = predictGenePeakPair(
x.sp,
input.mat="pmat",
gene.name="C3AR1",
gene.loci=resize(TSS.loci, width=500000, fix="center"),
do.par=FALSE
);
# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
chr1=pairs.df[,"seqnames"],
start1=pairs.df[,"start"],
end1=pairs.df[,"end"],
chr2="chr2",
start2=8219067,
end2=8219068,
Pval=pairs.df[,"logPval"]
);
> head(pairs.df)
## chr1 start1 end1 chr2 start2 end2 Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068 5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068 0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068 2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068 0.0000000
參考來(lái)源:https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md