使用Cicero包進行單細胞ATAC-seq分析（二）：Constructing cis-regulatory networks

image

往期精選

使用Signac包進行單細胞ATAC-seq數(shù)據(jù)分析（一）：Analyzing PBMC scATAC-seq
使用Signac包進行單細胞ATAC-seq數(shù)據(jù)分析（二）：Motif analysis with Signac
使用Signac包進行單細胞ATAC-seq數(shù)據(jù)分析（三）：scATAC-seq data integration
使用Signac包進行單細胞ATAC-seq數(shù)據(jù)分析（四）：Merging objects
使用Cicero包進行單細胞ATAC-seq分析（一）：Cicero introduction and installing

在本教程中槐雾，我們將學(xué)習(xí)使用Cicero基于單細胞ATAC-seq數(shù)據(jù)構(gòu)建順式調(diào)控元件互作網(wǎng)絡(luò)夭委。

Running Cicero 運行Cicero

Create a Cicero CDS

由于單細胞染色質(zhì)可及性的數(shù)據(jù)通常非常稀疏，因此我們需要聚合相似的細胞以創(chuàng)建更密集的計數(shù)數(shù)據(jù)募强，來準確的計算評估共可及性（co-accessibility）分數(shù)株灸。Cicero使用k最近鄰方法來創(chuàng)建一個重疊的細胞集，并基于細胞相似度的降維坐標圖（如來自tSNE或DDRTree圖）來構(gòu)建這些集合擎值。

這里慌烧，我們將以tSNE和DDRTree降維方法進行展示，這兩種降維方法均可從Monocle（由Cicero加載）獲得幅恋。數(shù)據(jù)降維后杏死，我們可以使用make_cicero_cds函數(shù)創(chuàng)建聚合的CDS對象泵肄。make_cicero_cds函數(shù)以CDS對象和降維后的坐標為輸入捆交，降維后的坐標reduce_coordinates應(yīng)該采用data.frame或矩陣的形式，其中行名稱與CDS的pData表中的細胞ID相匹配腐巢，列為降維對象的坐標品追，例如：

image

set.seed(2017)
input_cds <- detectGenes(input_cds)
input_cds <- estimateSizeFactors(input_cds)
# 使用tSNE方法進行數(shù)據(jù)降維
input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6,
                      reduction_method = 'tSNE', norm_method = "none")

# 提取tSNE降維后的坐標
tsne_coords <- t(reducedDimA(input_cds))
row.names(tsne_coords) <- row.names(pData(input_cds))
# 使用make_cicero_cds函數(shù)構(gòu)建cicero CDS對象
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords)

Run Cicero

Cicero包的主要功能是通過計算評估全基因組中位點之間的共可及性，從而預(yù)測順式調(diào)控相互作用冯丙。
Cicero提供了兩種方法來獲取此信息：

run_cicero：get Cicero outputs with all defaults（使用所有默認值獲得Cicero的輸出）肉瓦。run_cicero函數(shù)將使用默認值調(diào)用Cicero代碼的每個相關(guān)部分，并在計算過程中找出最佳估計參數(shù)胃惜。對于大多數(shù)用戶來說泞莉，這將是最好的起點。
Call functions separately, for more flexibility（分別調(diào)用函數(shù)船殉，以提高靈活性）鲫趁。對于希望在調(diào)用的參數(shù)上具有更大的靈活性，以及想要訪問中間信息的用戶利虫，Cicero允許我們分別調(diào)用每個組件挨厚。

image

（一）直接調(diào)用run_cicero函數(shù)（Recommended）

使用Cicero計算獲得共可及性得分的最簡單方法是直接調(diào)用run_cicero函數(shù)堡僻。該函數(shù)需要一個cicero CDS對象和一個基因組坐標文件（里面包含所用參考基因組的每個染色體的長度）作為輸入。Cicero包中提供了人類hg19的坐標疫剃，可以通過data("human.hg19.genome")獲取訪問钉疫。

# 加載人hg19參考基因組的坐標信息
data("human.hg19.genome")
# 提取18號染色體的信息
sample_genome <- subset(human.hg19.genome, V1 == "chr18")

# 直接使用run_cicero函數(shù)計算共可及性得分
conns <- run_cicero(cicero_cds, sample_genome) # Takes a few minutes to run
# 查看運行的結(jié)果
head(conns)
Peak1   Peak2   coaccess
chr18_10025_10225   chr18_104385_104585 0.03791526
chr18_10025_10225   chr18_10603_11103   0.86971957
chr18_10025_10225   chr18_111867_112367 0.00000000

（二）分別調(diào)用Cicero的不同函數(shù)（Alternative）

我們還可以使用一種替代的方法分別調(diào)用Cicero流程各個部分的函數(shù)，構(gòu)成Cicero流程的主要有三個功能：

estimate_distance_parameter. 該函數(shù)基于基因組的隨機小窗口計算距離懲罰參數(shù)巢价。
generate_cicero_models. 此函數(shù)使用上一步確定的距離參數(shù)牲阁，并使用圖形化LASSO模型基于距離的懲罰來計算基因組重疊窗口的共可及性得分。
assemble_connections. 此函數(shù)將generate_cicero_models的輸出作為輸入壤躲，并協(xié)調(diào)重疊的模型以創(chuàng)建最終共可及性得分的列表咨油。

以下是每個函數(shù)相關(guān)參數(shù)的解釋說明：

image

Visualizing Cicero Connections 可視化Cicero得到的連接

Cicero包提供了一個通用的繪圖功能，可以使用plot_connections函數(shù)可視化共可及性柒爵。此函數(shù)使用Gviz包的框架繪制基因組瀏覽器樣式的圖役电。我們改編了Sushi R包中的函數(shù)來映射連接。plot_connections函數(shù)有很多參數(shù)棉胀，其中一些在“高級可視化”部分中進行了詳細介紹法瑟。

# 加載基因注釋文件
data(gene_annotation_sample)

# 使用plot_connections進行可視化
plot_connections(conns, "chr18", 8575097, 8839855, 
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = .25, 
                connection_width = .5, 
                collapseTranscripts = "longest" )

image

Comparing Cicero connections to other datasets 比較Cicero連接和其他數(shù)據(jù)集

通常，將Cicero計算得到的連接（connections）與其他具有類似連接類型的數(shù)據(jù)集進行比較非常有用唁奢。例如霎挟，我們可能想將Cicero的輸出與ChIA-PET的連接進行比較。為此麻掸，Cicero提供了一個名為compare_connections的函數(shù)酥夭，此函數(shù)以兩個數(shù)據(jù)框的連接對（connection pairs）conns1和conns2作為輸入，并從conns2中找到的conns1返回連接的邏輯向量脊奋。它使用GenomicRanges包進行此功能的比較熬北，并使用該程序包中的max_gap參數(shù)允許比較時出現(xiàn)斜率。
例如诚隙，如果我們想將Cicero預(yù)測的連接結(jié)果與一組（虛構(gòu)的）ChIA-PET連接數(shù)據(jù)進行比較讶隐，則可以運行：

# 構(gòu)建ChIA-PET的連接數(shù)據(jù)
chia_conns <-  data.frame(Peak1 = c("chr18_10000_10200", "chr18_10000_10200", 
                                  "chr18_49500_49600"), 
                          Peak2 = c("chr18_10600_10700", "chr18_111700_111800", 
                                  "chr18_10600_10700"))
# 查看連接數(shù)據(jù)
head(chia_conns)
#               Peak1               Peak2
# 1 chr18_10000_10200   chr18_10600_10700
# 2 chr18_10000_10200 chr18_111700_111800
# 3 chr18_49500_49600   chr18_10600_10700

# 使用compare_connections函數(shù)進行比較
conns$in_chia <- compare_connections(conns, chia_conns)
# 查看比較后的結(jié)果
head(conns)
#               Peak1               Peak2    coaccess in_chia
# 2 chr18_10025_10225   chr18_10603_11103  0.85209126    TRUE
# 3 chr18_10025_10225   chr18_11604_13986 -0.55433268   FALSE
# 4 chr18_10025_10225   chr18_49557_50057 -0.43594546   FALSE
# 5 chr18_10025_10225   chr18_50240_50740 -0.43662436   FALSE
# 6 chr18_10025_10225 chr18_104385_104585  0.00000000   FALSE
# 7 chr18_10025_10225 chr18_111867_112367  0.01405174   FALSE

當比較兩個完全不同的數(shù)據(jù)集時，我們可能會發(fā)現(xiàn)這種重疊過于嚴格久又。仔細觀察巫延，ChIA-PET數(shù)據(jù)的第二行與conns的最后一行非常接近，差異僅約67個堿基對地消，這可能是peak-calling的問題炉峰，我們可以調(diào)整max_gap參數(shù)：

conns$in_chia_100 <- compare_connections(conns, chia_conns, maxgap=100)

head(conns)
#               Peak1               Peak2    coaccess in_chia in_chia_100
# 2 chr18_10025_10225   chr18_10603_11103  0.85209126    TRUE        TRUE
# 3 chr18_10025_10225   chr18_11604_13986 -0.55433268   FALSE       FALSE
# 4 chr18_10025_10225   chr18_49557_50057 -0.43594546   FALSE       FALSE
# 5 chr18_10025_10225   chr18_50240_50740 -0.43662436   FALSE       FALSE
# 6 chr18_10025_10225 chr18_104385_104585  0.00000000   FALSE       FALSE
# 7 chr18_10025_10225 chr18_111867_112367  0.01405174   FALSE        TRUE

此外，Cicero的繪圖功能還可以直觀地比較不同的數(shù)據(jù)集脉执。為此疼阔，請使用compare_track參數(shù)。用來比較的連接數(shù)據(jù)必須包括前兩個峰值列之外的第三列适瓦，稱為“coaccess”竿开，該值決定了繪圖中連接的高度谱仪。這可能是一種定量方法，例如ChIA-PET中的連接數(shù)否彩。

# Add a column of 1s called "coaccess"
# 構(gòu)建ChIA-PET連接數(shù)據(jù)
chia_conns <-  data.frame(Peak1 = c("chr18_10000_10200", "chr18_10000_10200", 
                                  "chr18_49500_49600"), 
                          Peak2 = c("chr18_10600_10700", "chr18_111700_111800", 
                                  "chr18_10600_10700"),
                          coaccess = c(1, 1, 1))

# 使用plot_connections函數(shù)進行可視化
plot_connections(conns, "chr18", 10000, 112367, 
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0,
                connection_width = .5,
                comparison_track = chia_conns,
                include_axis_track = F,
                collapseTranscripts = "longest")

image

Finding cis-Co-accessibility Networks (CCANS) 構(gòu)建順式共可及性網(wǎng)絡(luò)

除了計算成對的共可及性得分疯攒，Cicero還可以用來構(gòu)建順式共可及性網(wǎng)絡(luò)Cis-Co-accessibility Networks（CCAN），CCAN是彼此高度共可及性互作模塊列荔。我們使用Louvain檢測算法（Blondel等人敬尺，2008）來查找傾向于共可及性的互作集群。函數(shù)generate_ccans將連接數(shù)據(jù)作為輸入贴浙，并為每個輸入峰值輸出CCAN評分砂吞。

image

generate_ccans函數(shù)有一個coaccess_cutoff_override參數(shù)，當該參數(shù)設(shè)置為NULL時崎溃，該函數(shù)將根據(jù)不同cutoff的總CCAN數(shù)量蜻直，確定并報告CCAN生成的合適的共可及性得分的閾值。我們還可以將coaccess_cutoff_override值設(shè)置為介于0和1之間的數(shù)字袁串，以覆蓋函數(shù)的cutoff-find部分幕侠。如果您覺得自動找到的臨界值太嚴格或太松散谦去，或者如果您正在重新運行代碼并知道臨界值龄毡，則此選項很有用薄声，因為臨界值的查找過程可能會很慢。

# 使用generate_ccans函數(shù)構(gòu)建CCAN
CCAN_assigns <- generate_ccans(conns)
# [1] "Co-accessibility cutoff used: 0.24"
# 查看計算的結(jié)果
head(CCAN_assigns)
#                                    Peak CCAN
# chr18_10025_10225     chr18_10025_10225    1
# chr18_10603_11103     chr18_10603_11103    1
# chr18_11604_13986     chr18_11604_13986    1
# chr18_49557_50057     chr18_49557_50057    1
# chr18_50240_50740     chr18_50240_50740    1
# chr18_157883_158536 chr18_157883_158536    1

Cicero gene activity scores 計算Cicero基因活性得分

我們發(fā)現(xiàn)破镰，直接通過啟動子的可及性預(yù)測基因的表達餐曼，其結(jié)果往往不是很好。但是鲜漩，使用Cicero得到的連接源譬，我們可以更好地了解啟動子及其相關(guān)遠端位置的總體可及性。這些區(qū)域可及性的綜合得分與基因表達具有更好的一致性宇整。我們稱此分數(shù)為Cicero基因活性分數(shù)瓶佳，它是使用兩個函數(shù)計算得出的芋膘。

初始函數(shù)稱為build_gene_activity_matrix鳞青。此函數(shù)以CDS對象和Cicero連接列表為輸入，計算得出未標準化的基因活性得分表格为朋。注意臂拓，在輸入的CDS對象中必須在fData表中有一列稱為“基因”，如果該peak是啟動子习寸，則指示該基因胶惰；如果該peak是在遠端，則指示為NA霞溪。

使用build_gene_activity_matrix函數(shù)計算得到的基因活性得分是未標準化的孵滞，我們還需使用第二個函數(shù)normalize_gene_activities對其進行標準化處理中捆。如果要比較不同數(shù)據(jù)子集中的基因活性，則應(yīng)通過傳入未歸一化的矩陣列表來對所有基因活性子集進行歸一化坊饶。如果只希望對一個矩陣進行歸一化泄伪，只需將其自己傳遞給函數(shù)。標準化后的基因活性得分范圍是0到1匿级。

#### Add a column for the fData table indicating the gene if a peak is a promoter ####
# Create a gene annotation set that only marks the transcription start sites of the genes. We use this as a proxy for promoters.
# To do this we need the first exon of each transcript

pos <- subset(gene_annotation_sample, strand == "+")
pos <- pos[order(pos$start),] 
pos <- pos[!duplicated(pos$transcript),] # remove all but the first exons per transcript
pos$end <- pos$start + 1 # make a 1 base pair marker of the TSS

neg <- subset(gene_annotation_sample, strand == "-")
neg <- neg[order(neg$start, decreasing = TRUE),] 
neg <- neg[!duplicated(neg$transcript),] # remove all but the first exons per transcript
neg$start <- neg$end - 1

gene_annotation_sub <- rbind(pos, neg)

# Make a subset of the TSS annotation columns containing just the coordinates and the gene name
gene_annotation_sub <- gene_annotation_sub[,c(1:3, 8)]

# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"

input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)

head(fData(input_cds))
#                               site_name chr    bp1    bp2 num_cells_expressed overlap          gene
# chr18_10025_10225     chr18_10025_10225  18  10025  10225                   5      NA          <NA>
# chr18_10603_11103     chr18_10603_11103  18  10603  11103                   1       1    AP005530.1
# chr18_11604_13986     chr18_11604_13986  18  11604  13986                   9     203    AP005530.1
# chr18_49557_50057     chr18_49557_50057  18  49557  50057                   2     331 RP11-683L23.1
# chr18_50240_50740     chr18_50240_50740  18  50240  50740                   2     129 RP11-683L23.1
# chr18_104385_104585 chr18_104385_104585  18 104385 104585                   1      NA          <NA>


#### Generate gene activity scores ####
# generate unnormalized gene activity matrix
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)

# remove any rows/columns with all zeroes
unnorm_ga <- unnorm_ga[!Matrix::rowSums(unnorm_ga) == 0, !Matrix::colSums(unnorm_ga) == 0]

# make a list of num_genes_expressed
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))

# normalize
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)

# if you had two datasets to normalize, you would pass both:
# num_genes should then include all cells from both sets
unnorm_ga2 <- unnorm_ga
cicero_gene_activities <- normalize_gene_activities(list(unnorm_ga, unnorm_ga2), num_genes)

Advanced visualizaton 高級可視化技巧

Some useful parameters

Viewpoints：Viewpoints參數(shù)可以讓我們僅查看來自基因組中特定位置的連接情況蟋滴。這在與4C-seq數(shù)據(jù)進行比較時可能很有用。

# viewpoint = "chr18_48000_53000"參數(shù)指定可視化特定區(qū)域
plot_connections(conns, "chr18", 10000, 112367, 
                viewpoint = "chr18_48000_53000",
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0,
                connection_width = .5,
                comparison_track = chia_conns,
                include_axis_track = F,
                collapseTranscripts = "longest")

image

alpha_by_coaccess: alpha_by_coaccess參數(shù)可以使連接曲線的Alpha（透明度）基于共可及性的大小進行縮放痘绎。比較以下兩個圖津函。

# alpha_by_coaccess = FALSE
plot_connections(conns, 
                alpha_by_coaccess = FALSE, 
                "chr18", 8575097, 8839855, 
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0.1, 
                connection_width = .5, 
                collapseTranscripts = "longest" )

# alpha_by_coaccess = TRUE
plot_connections(conns, 
                alpha_by_coaccess = TRUE, 
                "chr18", 8575097, 8839855, 
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0.1, 
                connection_width = .5, 
                collapseTranscripts = "longest" )

image

Colors: plot_connections函數(shù)提供了許多用于設(shè)定不同類型顏色的參數(shù)。
peak_color
comparison_peak_color
connection_color
comparison_connection_color
gene_model_color
viewpoint_color
viewpoint_fill

可以使用顏色的名稱（如：“blue”）或顏色代碼（如：“＃B4656F”）的形式為每個參數(shù)提供顏色值孤页。另外尔苦，對于前四個參數(shù)，我們可以使用conns文件中提供顏色值的列行施。

# When the color column is not already colors, random colors are assigned
plot_connections(conns, 
                "chr18", 10000, 112367,
                connection_color = "in_chia_100",
                comparison_track = chia_conns,
                peak_color = "green",
                comparison_peak_color = "orange",
                comparison_connection_color = "purple",
                gene_model_color = "#2DD881",
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0.1, 
                connection_width = .5, 
                collapseTranscripts = "longest" )

image

# If I want specific color scheme, I can make a column of color names
conns$conn_color <- "orange"
conns$conn_color[conns$in_chia_100] <- "green"
plot_connections(conns, 
                "chr18", 10000, 112367,
                connection_color = "conn_color",
                comparison_track = chia_conns,
                peak_color = "green",
                comparison_peak_color = "orange",
                comparison_connection_color = "purple",
                gene_model_color = "#2DD881",
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0.1, 
                connection_width = .5, 
                collapseTranscripts = "longest" )

image

# For coloring Peaks, I need the color column to correspond to Peak1:
conns$peak1_color <- FALSE
conns$peak1_color[conns$Peak1 == "chr18_11604_13986"] <- TRUE
plot_connections(conns, 
                "chr18", 10000, 112367,
                connection_color = "green",
                comparison_track = chia_conns,
                peak_color = "peak1_color",
                comparison_peak_color = "orange",
                comparison_connection_color = "purple",
                gene_model_color = "#2DD881",
                gene_model = gene_annotation_sample, 
                coaccess_cutoff = 0.1, 
                connection_width = .5, 
                collapseTranscripts = "longest" )

image

Customizing everything with return_as_list
如果設(shè)置return_as_list參數(shù)為TRUE蕉堰，將不會繪制出圖像，而是返回一個圖像列表悲龟。

# 設(shè)置return_as_list = TRUE
plot_list <- plot_connections(conns, 
                    "chr18", 10000, 112367,
                    gene_model = gene_annotation_sample, 
                    coaccess_cutoff = 0.1, 
                    connection_width = .5, 
                    collapseTranscripts = "longest", 
                    return_as_list = TRUE)
plot_list
# [[1]]
# CustomTrack ''

# [[2]]
# AnnotationTrack 'Peaks'
# | genome: NA
# | active chromosome: chr18
# | annotation features: 7

# [[3]]
# Genome axis 'Axis'

# [[4]]
# GeneRegionTrack 'Gene Model'
# | genome: NA
# | active chromosome: chr18
# | annotation features: 33

這將按繪圖順序返回一個由各個圖形元件組成的列表屋讶。首先是CustomTrack，它是Cicero連接圖形须教，其次是峰注釋軌道（peak annotation track）皿渗，再就是基因組軸軌道（genome axis track），最后是基因模型軌道（gene model track）∏嵯伲現(xiàn)在乐疆，我可以使用Gviz包添加其他的track，并重新排列tracks或替換tracks：

conservation <- UcscTrack(genome = "hg19", chromosome = "chr18",
                        track = "Conservation", table = "phyloP100wayAll",
                        fontsize.group=6,fontsize=6, cex.axis=.8,
                        from = 10000, to = 112367, trackType = "DataTrack",
                        start = "start", end = "end", data = "score", size = .1,
                        type = "histogram", window = "auto", col.histogram = "#587B7F",
                        fill.histogram = "#587B7F", ylim = c(-1, 2.5),
                        name = "Conservation")

# I will replace the genome axis track with a track on conservation values
plot_list[[3]] <- conservation   

# To make the plot, I will now use Gviz's plotTracks function
# The included options are the defaults in plot_connections, 
# but all can be modified according to Gviz's documentation

# The main new paramter that you must include, is the sizes
# parameter. This parameter controls what proportion of the
# height of your plot is allocated for each track. The sizes
# parameter must be a vector of the same length as plot_list

Gviz::plotTracks(plot_list,  
                sizes = c(2,1,1,1),
                from = 10000, to = 112367, chromosome = "chr18", 
                transcriptAnnotation = "symbol",
                col.axis = "black", 
                fontsize.group = 6,
                fontcolor.legend = "black",
                lwd=.3,
                title.width = .5,
                background.title = "transparent", 
                col.border.title = "transparent")

image

Important Considerations for Non-Human Data

Cicero的默認參數(shù)是設(shè)計用于人類和小鼠的贬养。對于其他不同的模式生物挤土，我們需要更改一些相應(yīng)參數(shù)，以下是對每個參數(shù)的簡短說明：

image

參考來源：https://cole-trapnell-lab.github.io/cicero-release/docs/