hello莲兢,大家好汹来,禮拜天了,明天又要上班了改艇,但是很想寫一點(diǎn)東西收班,留下點(diǎn)足跡,但是又不想太動(dòng)腦子谒兄,所以分享一篇代碼文摔桦,分享給大家,單細(xì)胞轉(zhuǎn)錄組和ATAC的聯(lián)合分析。
Joint definition of cell types from single-cell gene expression and chromatin accessibility data
說(shuō)起liger邻耕,大家應(yīng)該不陌生鸥咖,大家可以參考我的文章10X單細(xì)胞(空間轉(zhuǎn)錄組)數(shù)據(jù)整合分析批次矯正之liger,算法不錯(cuò)兄世,核心就是NMF啼辣,關(guān)于NMF的運(yùn)用,我也分享了大量的文章御滩,大家可以參考
聯(lián)合分析 scRNA-seq 和 scATAC-seq 的流程類似于整合多個(gè) scRNA-seq 數(shù)據(jù)集的流程鸥拧,都依賴于聯(lián)合矩陣分解和分位數(shù)歸一化。 主要區(qū)別在于:(1)scATAC-seq數(shù)據(jù)需要處理成基因級(jí)別的值削解; (2) 僅對(duì) scRNA-seq 數(shù)據(jù)進(jìn)行基因選擇富弦; (3) 下游分析可以使用基因水平和基因間信息。
Stage I: Preprocessing and Normalization
為了聯(lián)合分析 scRNA 和 scATAC-seq 數(shù)據(jù)氛驮,首先需要將 scATAC-seq 數(shù)據(jù)(一種全基因組表觀基因組測(cè)量)轉(zhuǎn)換為與來(lái)自 scRNA-seq 的基因表達(dá)數(shù)據(jù)相當(dāng)?shù)幕蛩接?jì)數(shù)腕柜。大多數(shù)以前的單細(xì)胞研究都使用了一種受傳統(tǒng)bulk ATAC-seq 分析啟發(fā)的方法:識(shí)別染色質(zhì)可及峰,然后將與每個(gè)基因重疊的所有峰相加柳爽。這種策略也很有吸引力媳握,因?yàn)?10X CellRanger pipeline是一種常用的商業(yè)軟件包碱屁,可以自動(dòng)輸出這樣的峰值計(jì)數(shù)磷脯。然而,分析發(fā)現(xiàn)這種策略不太理想娩脾,因?yàn)椋?strong>(1)使用所有細(xì)胞執(zhí)行峰值調(diào)用赵誓,這會(huì)偏向稀有細(xì)胞群; (2) 基因可及性通常比特定調(diào)控元件更分散柿赊,因此可能會(huì)被峰值調(diào)用算法遺漏俩功; (3) 峰值之外的讀數(shù)信息被丟棄,進(jìn)一步減少了已經(jīng)稀疏的測(cè)量中的數(shù)據(jù)量碰声。發(fā)現(xiàn)最簡(jiǎn)單的策略似乎效果很好诡蜓,而不是求和峰值計(jì)數(shù):計(jì)算每個(gè)細(xì)胞中每個(gè)基因的基因體和啟動(dòng)子區(qū)域(通常為上游 3 kb)內(nèi)的 ATAC-seq 讀數(shù)的總數(shù)。
加載數(shù)據(jù)
library(liger)
D5T1 <- readRDS('~/Downloads/GSM4138888_scATAC_BMMC_D5T1.RDS')
rna1 <- readRDS('~/Downloads/GSM4138872_scRNA_BMMC_D1T1.rds')
rna2 <- readRDS('~/Downloads/GSM4138873_scRNA_BMMC_D1T2.rds')
bmmc.rna <- cbind(rna1,rna2)
rm(rna1, rna2)
必須首先使用 sort 命令行實(shí)用程序按染色體胰挑、開(kāi)始和結(jié)束位置對(duì) Fragments.tsv 進(jìn)行排序蔓罚。 -k 選項(xiàng)讓用戶在某個(gè)列上對(duì)文件進(jìn)行排序;包括多個(gè) -k 選項(xiàng)允許同時(shí)按多列排序瞻颂。 -k 后面的 n 代表“數(shù)字排序”豺谈。這里排序的 .bed 文件順序首先由字典染色體順序定義(使用參數(shù) -k1,1),然后是升序整數(shù)起始坐標(biāo)順序(使用參數(shù) -k2,2n)贡这,最后是升序整數(shù)結(jié)束坐標(biāo)順序(使用參數(shù)-k3,3n)茬末。
sort -k1,1 -k2,2n -k3,3n GSM4138888_scATAC_BMMC_D5T1.fragments.tsv > atac_fragments.sort.bed
sort -k 1,1 -k2,2n -k3,3n hg19_genes.bed > hg19_genes.sort.bed
sort -k 1,1 -k2,2n -k3,3n hg19_promoters.bed > hg19_promoters.sort.bed
bedmap --ec --delim "\t" --echo --echo-map-id hg19_promoters.sort.bed atac_fragments.sort.bed > atac_promoters_bc.bed
bedmap --ec --delim "\t" --echo --echo-map-id hg19_genes.sort.bed atac_fragments.sort.bed > atac_genes_bc.bed
Important flags of bedmap command are as follows:
- –delim. This changes output delimiter from ‘|’ to indicated delimiter between columns, which in our case is “\t”.
- –ec. Adding this will check all problematic input files.
- –echo. Adding this will print each line from reference file in output. The reference file in our case is gene or promoter index.
- –echo-map-id. Adding this will list IDs of all overlapping elements from mapping files, which in our case are cell barcodes from fragment files.
- We then import the bedmap outputs into the R Console or RStudio. Note that the as.is option in read.table is specified to prevent the conversion of character columns to factor columns:
genes.bc <- read.table(file = "atac_genes_bc.bed", sep = "\t", as.is = c(4,7), header = FALSE)
promoters.bc <- read.table(file = "atac_promoters_bc.bed", sep = "\t", as.is = c(4,7), header = FALSE)
bc <- genes.bc[,7]
bc_split <- strsplit(bc,";")
bc_split_vec <- unlist(bc_split)
bc_unique <- unique(bc_split_vec)
bc_counts <- table(bc_split_vec)
bc_filt <- names(bc_counts)[bc_counts > 1500]
barcodes <- bc_filt
We can then use LIGER’s makeFeatureMatrix function to calculate accessibility counts for gene body and promoter individually. This function takes the output from bedmap and efficiently counts the number of fragments overlapping each gene and promoter. We could count the genes and promoters in a single step, but choose to calculate them separately in case it is necessary to look at gene or promoter accessibility individually in downstream analyses.
library(liger)
gene.counts <- makeFeatureMatrix(genes.bc, barcodes)
promoter.counts <- makeFeatureMatrix(promoters.bc, barcodes)
Next, these two count matrices need to be re-sorted by gene symbol. We then add the matrices together, yielding a single matrix of gene accessibility counts in each cell.
gene.counts <- gene.counts[order(rownames(gene.counts)),]
promoter.counts <- promoter.counts[order(rownames(promoter.counts)),]
D5T1 <- gene.counts + promoter.counts
colnames(D5T1)=paste0("D5T1_",colnames(D5T1))
5. Once the gene-level scATAC-seq counts are generated, the read10X function from LIGER can be used to read scRNA-seq count matrices output by CellRanger. You can pass in a directory (or a list of directories) containing raw outputs (for example, “/Sample_1/outs/filtered_feature_bc_matrix”) to the parameter sample.dirs. Next, a vector of names to use for the sample (or samples, corresponding to sample.dirs) should be passed to parameter sample.names as well. LIGER can also use data from any other protocol, as long as it is provided in a genes x cells R matrix format.
bmmc.rna <- read10X(sample.dirs = list("/path_to_sample"), sample.names = list("rna"))
6. We can now create a LIGER object with the createLiger function. We also remove unneeded variables to conserve memory.
bmmc.data <- list(atac = D5T1, rna = bmmc.rna)
int.bmmc <- createLiger(bmmc.data)
rm(genes.bc, promoters.bc, gene.counts, promoter.counts, D5T1, bmmc.rna, bmmc.data)
gc()
7. Preprocessing steps are needed before running iNMF. Each dataset is normalized to account for differences in total gene-level counts across cells using the normalize function. Next, highly variable genes from each dataset are identified and combined for use in downstream analysis. Note that by setting the parameter datasets.use to 2, genes will be selected only from the scRNA-seq dataset (the second dataset) by the selectGenes function. We recommend not using the ATAC-seq data for variable gene selection because the statistical properties of the ATAC-seq data are very different from scRNA-seq, violating the assumptions made by the statistical model we developed for selecting genes from RNA data. Finally, the scaleNotCenter function scales normalized datasets without centering by the mean, giving the nonnegative input data required by iNMF.
int.bmmc <- normalize(int.bmmc)
int.bmmc <- selectGenes(int.bmmc, datasets.use = 2)
int.bmmc <- scaleNotCenter(int.bmmc)
Stage II: Joint Matrix Factorization (3 - 10 minutes)
8. We next perform joint matrix factorization (iNMF) on the normalized and scaled RNA and ATAC data. This step calculates metagenes–sets of co-expressed genes that distinguish cell populations–containing both shared and dataset-specific signals. The cells are then represented in terms of the “expression level” of each metagene, providing a low-dimensional representation that can be used for joint clustering and visualization. To run iNMF on the scaled datasets, we use the optimizeALS function with proper hyperparameter settings.
To run iNMF on the scaled datasets, use optimizeALS function with proper hyperparameters setting:
int.bmmc <- optimizeALS(int.bmmc, k = 20)
Important parameters are as follows:
- k. Integer value specifying the inner dimension of factorization, or number of factors. Higher k is recommended for datasets with more substructure. We find that a value of k in the range 20 - 40 works well for most datasets. Because this is an unsupervised, exploratory analysis, there is no single “right” value for k, and in practice, users choose k from a combination of biological prior knowledge and other information.
- lambda. This is a regularization parameter. Larger values penalize dataset-specific effects more strongly, causing the datasets to be better aligned, but possibly at the cost of higher reconstruction error. The default value is 5. We recommend using this value for most analyses, but find that it can be lowered to 1 in cases where the dataset differences are expected to be relatively small, such as scRNA-seq data from the same tissue but different individuals.
- thresh. This sets the convergence threshold. Lower values cause the algorithm to run longer. The default is 1e-6.
- max.iters. This variable sets the maximum number of iterations to perform. The default value is 30.
Stage III: Quantile Normalization and Joint Clustering (1 minute)
9. Using the metagene factors calculated by iNMF, we then assign each cell to the factor on which it has the highest loading, giving joint clusters that correspond across datasets. We then perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. To perform this analysis, typing in:
int.bmmc <- quantile_norm(int.bmmc)
Important parameters of quantile_norm are as follows:
- knn_k. This sets the number of nearest neighbors for within-dataset KNN graph. The default is 20.
- quantiles. This sets the number of quantiles to use for quantile normalization. The default is 50.
- min_cells. This indicates the minimum number of cells to consider a cluster as shared across datasets. The default is 20.
- dims.use.. This sets the indices of factors to use for quantile normalization. The user can pass in a vector of indices indicating specific factors. This is helpful for excluding factors capturing biological signals such as the cell cycle or technical signals such as mitochondrial genes. The default is all k of the factors.
- do.center. This indicates whether to center the data when scaling factors. The default is FALSE. This option should be set to TRUE when metagene loadings have a mean above zero, as with dense data such as DNA methylation.
- max_sample. This sets the maximum number of cells used for quantile normalization of each cluster and factor. The default is 1000.
- refine.knn. This indicates whether to increase robustness of cluster assignments using KNN graph. The default is TRUE.
- eps. This sets the error bound of the nearest neighbor search. The default is 0.9. Lower values give more accurate nearest neighbor graphs but take much longer to computer.
- ref_dataset. This indicates the name of the dataset to be used as a reference for quantile normalization. By default, the dataset with the largest number of cells is used.
10. The quantile_norm function gives joint clusters that correspond across datasets, which are often completely satisfactory and sufficient for downstream analyses. However, if desired, after quantile normalization, users can additionally run the Louvain algorithm for community detection, which is widely used in single-cell analysis and excels at merging small clusters into broad cell classes. This can be achieved by running the louvainCluster function. Several tuning parameters, including resolution, k, and prune control the number of clusters produced by this function. For this dataset, we use a resolution of 0.2, which yields 16 clusters (see below).
int.bmmc <- louvainCluster(int.bmmc, resolution = 0.2)
Stage IV: Visualization (2 - 3 minutes) and Downstream Analysis (30 - 40 minutes)
11. In order to visualize the clustering results, the user can use two dimensionality reduction methods supported by LIGER: t-SNE and UMAP. We find that often for datasets containing continuous variation such as cell differentiation, UMAP better preserves global relationships, whereas t-SNE works well for displaying discrete cell types, such as those in the brain. The UMAP algorithm (called by the runUMAP function) scales readily to large datasets. The runTSNE function also includes an option to use FFtSNE, a highly scalable implementation of t-SNE that can efficiently process huge datasets. For the BMMC dataset, we expect to see continuous lineage transitions among the differentiating cells, so we use UMAP to visualize the data in two dimensions:
int.bmmc <- runUMAP(int.bmmc, distance = 'cosine', n_neighbors = 30, min_dist = 0.3)
12. We can then visualize each cell, colored by cluster or dataset.
plotByDatasetAndCluster(int.bmmc, axis.labels = c('UMAP 1', 'UMAP 2'))
13. LIGER employs the Wilcoxon rank-sum test to identify marker genes that are differentially expressed in each cell type using the following settings. We provide parameters that allow the user to select which datasets to use (data.use) and whether to compare across clusters or across datasets within each cluster (compare.method). To identify marker genes for each cluster combining scATAC and scRNA profiles, typing in:
int.bmmc.wilcoxon <- runWilcoxon(int.bmmc, data.use = 'all', compare.method = 'clusters')
Important parameters of runWilcoxon are as follows:
- data.use. This selects which dataset (or set of datasets) to be included. The default is ‘a(chǎn)ll’ (using all the datasets).
- compare.method. This indicates whether to compare across clusters or across datasets with each cluster. Setting compare.method to ‘clusters’ compares each feature’s (genes, peaks, etc.) loading between clusters combining all datasets, which gives us the most specific features for each cluster. On the other hand, setting compare.method to ‘datasets’ gives us the most differentially loaded features for every cluster between datasets.
14. The number of marker genes identified by runWilcoxon varies and depends on the datasets used. The function outputs a data frame that the user can then filter to select markers which are statistically and biologically significant. For example, one strategy is to filter the output by taking markers which have padj (Benjamini-Hochberg adjusted p-value) less than 0.05 and logFC (log fold change between observations in group versus out) larger than 3:
int.bmmc.wilcoxon <- int.bmmc.wilcoxon[int.bmmc.wilcoxon$padj < 0.05,]
int.bmmc.wilcoxon <- int.bmmc.wilcoxon[int.bmmc.wilcoxon$logFC > 3,]
You can then re-sort the markers by its padj value in ascending order and choose the top 100 for each cell type. For example, we can subset and re-sort the output for Cluster 1 and take the top 20 markers by typing these commands:
wilcoxon.cluster_1 <- int.bmmc.wilcoxon[int.bmmc.wilcoxon$group == 1, ]
wilcoxon.cluster_1 <- wilcoxon.cluster_1[order(wilcoxon.cluster_1$padj), ]
markers.cluster_1 <- wilcoxon.cluster_1[1:20, ]
15. We also provide functions to check these markers by visualizing their expression and gene loadings across datasets. You can use the plotGene to visualize the expression or accessibility of a marker gene, which is helpful for visually confirming putative marker genes or investigating the distribution of known markers across the cell sequenced. Such plots can also confirm that divergent datasets are properly aligned.
For instance, we can plot S100A9, which the Wilcoxon test identified as a marker for Cluster 1, and MS4A1, a marker for Cluster 4:
S100A9 <- plotGene(int.bmmc, "S100A9", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
MS4A1 <- plotGene(int.bmmc, "MS4A1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
plot_grid(S100A9[[2]],MS4A1[[2]],S100A9[[1]],MS4A1[[1]], ncol=2)
These plots indicate that S100A9 and MS4A1 are indeed specific markers for Cluster 1 and Cluster 4, respectively, with high values in these cell groups and low values elsewhere. Furthermore, we can see that the distributions are strikingly similar between the RNA and ATAC datasets, indicating that LIGER has properly aligned the two data types.
16. A key advantage of using iNMF instead of other dimensionality reduction approaches such as PCA is that the dimensions are individually interpretable. For example, a particular cell type is often captured by a single dimension of the space. Furthermore, iNMF identifies both shared and dataset-specific features along each dimension, giving insight into exactly how corresponding cells across datasets are both similar and different. The function plotGeneLoadings allows visual exploration of such information. It is recommended to call this function into a PDF file due to the large number of plots produced.
pdf('Gene_Loadings.pdf')
plotGeneLoadings(int.bmmc, return.plots = FALSE)
dev.off()
Alternatively, the function can return a list of plots. For example, we can visualize the factor loading of Factor 7 typing in:
gene_loadings <- plotGeneLoadings(int.bmmc, do.spec.plot = FALSE, return.plots = TRUE)
gene_loadings[[7]]
CCR6 <- plotGene(int.bmmc, "CCR6", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
NCF1 <- plotGene(int.bmmc, "NCF1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
plot_grid(CCR6[[2]],NCF1[[2]],CCR6[[1]],NCF1[[1]], ncol=2)
These plots confirm that the expression and accessibility of these genes show clear differences. CCR6 shows nearly ubiquitous chromatin accessibility but is expressed only in clusters 2 and 4. The accessibility is highest in these clusters, but the ubiquitous accessibility suggests that the expression of CCR6 is somewhat decoupled from its accessibility, likely regulated by other factors. Conversely, NCF1 shows high expression in clusters 1, 3, 4, 9 and 11, despite no clear enrichment in chromatin accessibility within these clusters. This may again indicate decoupling between the expression and chromatin accessibility of NCF1. Another possibility is that the difference is due to technical effects–the gene body of NCF1 is short (~15KB), and short genes are more difficult to capture in scATAC-seq than in scRNA-seq because there are few sites for the ATAC-seq transposon to insert.
17. Single-cell measurements of chromatin accessibility and gene expression provide an unprecedented opportunity to investigate epigenetic regulation of gene expression. Ideally, such investigation would leverage paired ATAC-seq and RNA-seq from the same cells, but such simultaneous measurements are not generally available. However, using LIGER, it is possible to computationally infer “pseudo-multi-omic” profiles by linking scRNA-seq profiles–using the jointly inferred iNMF factors–to the most similar scATAC-seq profiles. After this imputation step, we can perform downstream analyses as if we had true single-cell multi-omic profiles. For example, we can identify putative enhancers by correlating the expression of a gene with the accessibility of neighboring intergenic peaks across the whole set of single cells.
Again, for convenience, we have prepared the pre-processed peak-level count data which is ready to use. The data can be downloaded here.
pmat <- readRDS('~/Downloads/GSM4138888_scATAC_BMMC_D5T1_peak_counts.RDS')
You can also follow the following tutorial to start from the beginning.
To achieve this, we first need a matrix of accessibility counts within intergenic peaks. The CellRanger pipeline for scATAC-seq outputs such a matrix by default, so we will use this as our starting point. The count matrix, peak genomic coordinates, and source cell barcodes output by CellRanger are stored in a folder named filtered_peak_matrix in the output directory. The user can load these and convert them into a peak-level count matrix by typing these commands:
barcodes <- read.table('/outs/filtered_peak_bc_matrix/barcodes.tsv', sep = '\t', header = FALSE, as.is = TRUE)$V1
peak.names <- read.table('/outs/filtered_peak_bc_matrix/peaks.bed', sep = '\t', header = FALSE)
peak.names <- paste0(peak.names$V1, ':', peak.names$V2, '-', peak.names$V3)
pmat <- readMM('/outs/filtered_peak_bc_matrix/matrix.mtx')
dimnames(pmat) <- list(peak.names, barcodes)
18. The peak-level count matrix is usually large, containing hundreds of thousands of peaks. We next filter this set of peaks to identify those showing cell-type-specific accessibility. To do this, we perform the Wilcoxon rank-sum test and pick those peaks which are differentially accessible within a specific cluster. Before running the test, however, we need to: (1) subset the peak-level count matrix to include the same cells as the gene-level counts matrix; (2) replace the original gene-level counts matrix in the LIGER object by peak-level counts matrix; and (3) normalize peak counts to sum to 1 within each cell.
int.bmmc.ds <- int.bmmc
pmat <- pmat[ ,intersect(colnames(pmat),colnames(int.bmmc@raw.data[['atac']]))]
int.bmmc.ds@raw.data[['atac']] <- pmat
int.bmmc.ds <- normalize(int.bmmc.ds)
Now we can perform the Wilcoxon test:
peak.wilcoxon <- runWilcoxon(int.bmmc.ds, data.use = 1, compare.method = 'clusters')
19. We can now use the results of the Wilcoxon test to retain only peaks showing differential accessibility across our set of joint clusters. Here we kept peaks with Benjamini-Hochberg adjusted p-value < 0.05 and log fold change > 2.
peak.wilcoxon <- peak.wilcoxon[peak.wilcoxon$padj < 0.05,]
peak.wilcoxon <- peak.wilcoxon[peak.wilcoxon$logFC > 2,]
peaks.sel <- unique(peak.wilcoxon$feature)
int.bmmc.ds@raw.data[['atac']] <- int.bmmc.ds@raw.data[['atac']][peaks.sel, ]
20. Using this set of differentially accessible peaks, we now impute a set of “pseudo-multi-omic” profiles by inferring the intergenic peak accessibility for scRNA-seq profiles based on their nearest neighbors in the joint LIGER space. LIGER provides a function named imputeKNN that performs this task, yielding a set of profiles containing both gene expression and chromatin accessibility measurements for the same single cells:
int.bmmc.ds <- imputeKNN(int.bmmc.ds, reference = 'atac')
Important parameters of imputeKNN are as follows:
- reference. Dataset containing values to impute into query dataset(s). For example, setting reference = ‘a(chǎn)tac’ uses the values in dataset ‘a(chǎn)tac’ to predict chromatin accessibility values for scRNA-seq profiles.
- queries. Dataset to be augmented by imputation. For example, setting query = ‘rna’ predicts chromatin accessibility values for scRNA-seq profiles.
- knn_k. The maximum number of nearest neighbors to use for imputation. The imputation algorithm simply builds a k-nearest neighbor graph using the aligned LIGER latent space, then averages values from the reference dataset across neighboring cells. The default value is 20.
- norm. This indicates whether to normalize data after imputation. The default is TRUE.
- scale. This indicates whether to scale data after imputation. The default is FALSE.
21. Now that we have both the (imputed) peak-level counts matrix and the (observed) gene expression counts matrix for the same cells, we can evaluate the relationships between pairs of genes and peaks, linking genes to putative regulatory elements. We use a simple strategy to identify such gene-peak links: Calculate correlation between gene expression and peak accessibility of all peaks within 500 KB of a gene, then retain all peaks showing statistically significant correlation with the gene. The linkGenesAndPeaks function performs this analysis:
gmat = int.bmmc@norm.data[['rna']]
pmat = int.bmmc.ds@norm.data[['rna']]
regnet = linkGenesAndPeaks(gene_counts = gmat, peak_counts = pmat, dist = 'spearman', alpha = 0.05, path_to_coords = 'some_path/gene_coords.bed')
rm(int.bmmc.ds, gmat, pmat)
Important parameters of linkGenesAndPeaks are as follows:
- gene_counts. A gene expression matrix (genes by cells) of normalized counts. This matrix has to share the same column names (cell barcodes) as the matrix passed to peak_counts.
- peak_counts. A peak-level matrix (peaks by cells) of normalized accessibility values, such as the one resulting from imputeKNN. This matrix must share the same column names (cell barcodes) as the matrix passed to gene_counts.
- genes.list. A list of the genes symbols to be tested. If not specified, this function will use all the gene symbols from the matrix passed to gmat by default.
- dist. This indicates the type of correlation to calculate – one of “spearman” (default), “pearson”, or “kendall”.
- alpha. Significance threshold for correlation p-value. Peak-gene correlations with p-values below this threshold are considered significant. The default is 0.05.
- path_to_coords.. The path to the gene coordinates file (in .bed format). We recommend passing in the same bed file used for making barcodes list in Step 1.
22. The output of this function is a sparse matrix with peak names as rows and gene symbols as columns, with each element indicating the correlation between peak i and gene j (or 0 if the gene and peak are not significantly linked). For example, we can subset the results for marker gene S100A9, which is a marker gene of Cluster 1 identified in the previous section, and rank these peaks by their correlation:
S100A9 <- regnet[, 'S100A9']
S100A9 <- S100A9[abs(S100A9) > 0]
View(S100A9[order(abs(S100A9), decreasing = TRUE)])
We also provide a function to transform the peaks-gene correlation matrix into an Interact Track supported by UCSC Genome Browser for visualizing the calculated linkage between genes and correlated peaks. To do this, tying in:
makeInteractTrack(regnet, genes.list = 'S100A9', path_to_coords = 'some_path/gene_coords.bed')
Important parameters of makeInteractTrack are as follows:
- corr.mat. A peaks x genes sparse matrix containing inferred gene-peak links (as output by linkGenesAndPeaks).
- genes.list. A vector of the gene symbols to be included in the output Interact Track file. If not specified, this function will use all the gene symbols from the matrix passed to corr.mat by default.
- path_to_coords. The path to the gene coordinates file (in .bed format). We recommend passing in the same bed file used for making barcodes list in Step 1.
- output_path. The path to the directory in which the Interact Track file will be stored. The default is the working directory.
The output of this function will be a UCSC Interact Track file named ‘Interact_Track.bed’ containing linkage information of the specified genes and correlated peaks stored in given directory. The user can then upload this file as a custom track using this page https://genome.ucsc.edu/cgi-bin/hgCustom and display it in the UCSC Genome browser.
As an example, the three peaks most correlated to S100A9 expression are shown below in the UCSC genome browser. One of the peaks overlaps with the TSS of S100A8, a neighboring gene that is co-expressed with S100A9, while another peak overlaps with the TSS of S100A9 itself. The last peak, chr1:153358896-153359396, does not overlap with a gene body and shows strong H3K27 acetylation across ENCODE cell lines, indicating that this is likely an intergenic regulatory element.
If we plot the accessibility of this peak and the expression of S100A9, we can see that the two are indeed very correlated and show strong enrichment in clusters 1 and 3. Thus, the intergenic peak likely serves as a cell-type-specific regulator of S100A9.
S100A9 <- plotGene(int.bmmc, "S100A9", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
peak1 <- plotGene(int.bmmc.ds, "chr1:153358896-153359396", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
peak1[[1]] / S100A9[[2]]
生活很好,有你更好