系列回顧:
ArchR官網(wǎng)教程學(xué)習(xí)筆記1:Getting Started with ArchR
ArchR官網(wǎng)教程學(xué)習(xí)筆記2:基于ArchR推測(cè)Doublet
ArchR官網(wǎng)教程學(xué)習(xí)筆記3:創(chuàng)建ArchRProject
ArchR官網(wǎng)教程學(xué)習(xí)筆記4:ArchR的降維
ArchR官網(wǎng)教程學(xué)習(xí)筆記5:ArchR的聚類
ArchR官網(wǎng)教程學(xué)習(xí)筆記6:單細(xì)胞嵌入(Single-cell Embeddings)
ArchR官網(wǎng)教程學(xué)習(xí)筆記7:ArchR的基因評(píng)分和Marker基因
ArchR官網(wǎng)教程學(xué)習(xí)筆記8:定義與scRNA-seq一致的聚類
ArchR官網(wǎng)教程學(xué)習(xí)筆記9:ArchR的偽批量重復(fù)
ArchR官網(wǎng)教程學(xué)習(xí)筆記10:ArchR的call peak
ArchR官網(wǎng)教程學(xué)習(xí)筆記11:鑒定Marker峰
ArchR官網(wǎng)教程學(xué)習(xí)筆記12:Motif和Feature富集
ArchR官網(wǎng)教程學(xué)習(xí)筆記13:ChromVAR偏差富集
ArchR官網(wǎng)教程學(xué)習(xí)筆記14:ArchR的Footprinting分析
在開始之前,先要對(duì)本章筆記說明一下觉啊。本章的中間“Peak2GeneLinkage”部分里的代碼禁炒,因?yàn)榘凑展倬W(wǎng)的代碼運(yùn)行總是報(bào)錯(cuò)沙郭,所以我去ArchR的github上直接問的developer(見https://github.com/GreenleafLab/ArchR/issues/442)啡直。根據(jù)developer說的败徊,是代碼本身的bug尿这,所以他們?cè)谛掳鍭rchR里修復(fù)了這個(gè)bug只磷。(如何安裝新版的ArchR下文里有提到)所以這就是對(duì)于任何我們想學(xué)習(xí)的R包或者軟件,一定要自己親自運(yùn)行一下名惩,否則即便你使用的官網(wǎng)示例數(shù)據(jù)和官方代碼澎胡,仍然有可能報(bào)錯(cuò)。
ArchR的主要優(yōu)勢(shì)之一是它能夠集成多層次的信息娩鹉,從而提供新的角度攻谁。這可以采取ATAC-seq分析的形式,如識(shí)別峰共可接近性以預(yù)測(cè)調(diào)控相互作用弯予,或整合scRNA-seq數(shù)據(jù)的分析戚宦,如通過peak-to-gene連鎖分析預(yù)測(cè)增強(qiáng)子活性。在任何一種情況下锈嫩,ArchR都可以輕松地從你的scATAC-seq數(shù)據(jù)中獲得更深入的見解受楼。
(一)創(chuàng)建低重疊的細(xì)胞聚集(Creating Low-Overlapping Aggregates of Cells)
ArchR方便了許多涉及特征相關(guān)性的綜合分析垦搬。用稀疏的單細(xì)胞數(shù)據(jù)進(jìn)行這些計(jì)算會(huì)導(dǎo)致相關(guān)性分析中大量的背景信號(hào)。為了規(guī)避這一挑戰(zhàn)艳汽,我們采用了Cicero的方法猴贰,在這些分析之前創(chuàng)建低重疊的單細(xì)胞聚集。為了減少偏差河狐,我們過濾與任何其他聚合重疊度超過80%的聚合( aggregates )米绕。為了提高這種方法的速度,我們開發(fā)了一個(gè)優(yōu)化的迭代重疊檢查過程的實(shí)現(xiàn)馋艺,并使用“Rcpp”包在c++中實(shí)現(xiàn)了快速的特征相關(guān)性栅干。這些優(yōu)化方法在ArchR中用于計(jì)算峰共可接近性、peak-to-gene的連鎖和其他連鎖分析捐祠。這些低重疊聚合的使用是在后臺(tái)進(jìn)行的碱鳞,但為了清楚起見,我們?cè)谶@里提到它雏赦。
(二)共可接近性分析
協(xié)同可接近性是指跨多個(gè)單細(xì)胞的兩個(gè)峰之間的可接近性相關(guān)性劫笙。換句話說,當(dāng)A峰在單細(xì)胞內(nèi)可接近時(shí)星岗,B峰通常也可接近填大。下面我們將直觀地說明這一概念,顯示enhancer E3通常與promoter P可共同接近俏橘。
關(guān)于共可接近性分析需要注意的一件事是允华,它通常將特定細(xì)胞類型的峰標(biāo)識(shí)為共可接近的。這是因?yàn)檫@些峰通常在單個(gè)細(xì)胞類型中都可以接近寥掐,而在所有其他細(xì)胞類型中通常都不能被接近靴寂。這產(chǎn)生了很強(qiáng)的相關(guān)性,但并不一定意味著這些峰之間存在調(diào)控關(guān)系召耘。
為了計(jì)算ArchR中的協(xié)同可接近性百炬,我們使用儲(chǔ)存在ArchRProject中協(xié)同可接近性峰信息的addCoAccessibility()
函數(shù)。
> projHeme5 <- addCoAccessibility(
ArchRProj = projHeme5,
reducedDims = "IterativeLSI"
)
我們可以通過getCoAccessibility()
函數(shù)從ArchRProject檢索這個(gè)協(xié)同可接近性信息污它,如果returnLoops = FALSE
剖踊,該函數(shù)將返回一個(gè)DataFrame對(duì)象:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 1,
returnLoops = FALSE
)
DataFrame包含了一些重要的信息。queryHits
和subjectHits
列表示是相關(guān)的兩個(gè)峰的索引衫贬。correlation列給出了這兩個(gè)峰之間的可接近性的數(shù)值相關(guān)性:
> cA
DataFrame with 66069 rows and 11 columns
queryHits subjectHits seqnames correlation Variability1 Variability2
<integer> <integer> <Rle> <numeric> <numeric> <numeric>
1 7 11 chr1 0.674068131604548 0.00454332740253749 0.0306386713792158
2 11 7 chr1 0.674068131604548 0.0306386713792158 0.00454332740253749
3 20 21 chr1 0.555556855508682 0.0263183019286269 0.00550013127461839
4 21 20 chr1 0.555556855508682 0.00550013127461839 0.0263183019286269
5 44 55 chr1 0.556849339818455 0.0328129362145881 0.0276923110094824
... ... ... ... ... ... ...
66065 140763 140761 chrX 0.544862329728945 0.00105524816663243 0.00103018596537361
66066 140819 140825 chrX 0.52077899590599 0.0082391986288855 0.00123800124356212
66067 140825 140819 chrX 0.52077899590599 0.00123800124356212 0.0082391986288855
66068 140825 140826 chrX 0.536143354146812 0.00123800124356212 0.000926987273521345
66069 140826 140825 chrX 0.536143354146812 0.000926987273521345 0.00123800124356212
TStat Pval FDR VarQuantile1 VarQuantile2
<numeric> <numeric> <numeric> <numeric> <numeric>
1 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39 0.497425136231847 0.922787369881207
2 14.8753870882021 1.57568490812297e-41 2.25369520385051e-39 0.922786833059823 0.497426746695999
3 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28 0.897581458618855 0.553941154713533
4 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28 0.553940081070765 0.897582532261623
5 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29 0.933085751311052 0.906216767401199
... ... ... ... ... ...
66065 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28 0.113733662512206 0.109280192310893
66066 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25 0.664993931234254 0.14330607891167
66067 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25 0.143306615733054 0.66499607851979
66068 11.8316525493 1.4493708075588e-28 5.40902049408846e-27 0.143306615733054 0.092152369234337
66069 11.8316525493 1.44937080755876e-28 5.40902049408846e-27 0.0921534428771049 0.14330607891167
這個(gè)共可接近性數(shù)據(jù)DataFrame還具有一個(gè)metadata組件德澈,該metadata組件包含相關(guān)峰的GRanges
對(duì)象。上面提到的queryHits
和subject
的索引適用于這個(gè)GRanges對(duì)象:
> metadata(cA)[[1]]
GRanges object with 140865 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
Mono chr1 752495-752995 *
CD4.N chr1 757871-758371 *
CD4.N chr1 762690-763190 *
GMP chr1 773670-774170 *
B chr1 801006-801506 *
... ... ... ...
NK chrX 154807254-154807754 *
Mono chrX 154840907-154841407 *
Mono chrX 154841994-154842494 *
NK chrX 154862043-154862543 *
GMP chrX 154996991-154997491 *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
上面的代碼里固惯,returnLoops = FALSE
梆造。如果我們?cè)O(shè)置returnLoops = TRUE
,那么getCoAccessibility()
將以循環(huán)track的形式返回co-accessibility信息葬毫。在這個(gè)GRanges
對(duì)象中镇辉,IRanges
的開始和結(jié)束映射到相互作用的兩個(gè)不同的可共同接近的峰屡穗。分辨率參數(shù)設(shè)置這些循環(huán)的堿基對(duì)分辨率。當(dāng)resolution = 1時(shí)忽肛,會(huì)創(chuàng)建連接每個(gè)峰中心的loop:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 1,
returnLoops = TRUE
)
> cA[[1]]
GRanges object with 33035 ranges and 9 metadata columns:
seqnames ranges strand | correlation Variability1
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 845576-856623 * | 0.674068131604548 0.00454332740253749
[2] chr1 894703-895398 * | 0.555556855508682 0.0263183019286269
[3] chr1 968510-1004214 * | 0.556849339818455 0.0328129362145881
[4] chr1 974306-975108 * | 0.572146747607589 0.0093636446649745
[5] chr1 1072037-1107432 * | 0.541661438734924 0.00679983990004832
... ... ... ... . ... ...
[33031] chrX 153276031-153306080 * | 0.782117525371192 0.0140283037671134
[33032] chrX 153306080-153342756 * | 0.538505319617638 0.00154044262912208
[33033] chrX 153591527-153594212 * | 0.544862329728945 0.00103018596537361
[33034] chrX 153941234-153959650 * | 0.52077899590599 0.0082391986288855
[33035] chrX 153959650-153960343 * | 0.536143354146812 0.000926987273521345
Variability2 TStat Pval FDR
<numeric> <numeric> <numeric> <numeric>
[1] 0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
[2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
[3] 0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
[4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
[5] 0.013335574910477 11.9534260620727 4.68165878439767e-29 1.83531999911093e-27
... ... ... ... ...
[33031] 0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
[33032] 0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
[33033] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
[33034] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
[33035] 0.00123800124356212 11.8316525493 1.44937080755876e-28 5.40902049408846e-27
VarQuantile1 VarQuantile2 value
<numeric> <numeric> <numeric>
[1] 0.497425136231847 0.922787369881207 0.674068131604548
[2] 0.897581458618855 0.553941154713533 0.555556855508682
[3] 0.933085751311052 0.906216767401199 0.556849339818455
[4] 0.696956276435098 0.581836004288129 0.572146747607589
[5] 0.613238444785505 0.773817288547399 0.541661438734924
... ... ... ...
[33031] 0.783751168257537 0.763801275165515 0.782117525371192
[33032] 0.190906567848586 0.763801275165515 0.538505319617638
[33033] 0.109280729132277 0.113733125690822 0.544862329728945
[33034] 0.664993931234254 0.14330607891167 0.52077899590599
[33035] 0.0921534428771049 0.14330607891167 0.536143354146812
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
相反鸡捐,如果我們將loop的分辨率降低到resolution = 1000,這將有助于過度繪制協(xié)同可接近性相互作用麻裁。下面,我們看到GRanges對(duì)象中的條目總數(shù)比上面的少:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 1000,
returnLoops = TRUE
)
> cA[[1]]
GRanges object with 31594 ranges and 9 metadata columns:
seqnames ranges strand | correlation Variability1
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 845500-856500 * | 0.674068131604548 0.00454332740253749
[2] chr1 894500-895500 * | 0.555556855508682 0.0263183019286269
[3] chr1 968500-1004500 * | 0.556849339818455 0.0328129362145881
[4] chr1 974500-975500 * | 0.572146747607589 0.0093636446649745
[5] chr1 1072500-1079500 * | 0.653852889650734 0.0171476615215405
... ... ... ... . ... ...
[31590] chrX 153276500-153306500 * | 0.782117525371192 0.0140283037671134
[31591] chrX 153306500-153342500 * | 0.538505319617638 0.00154044262912208
[31592] chrX 153591500-153594500 * | 0.544862329728945 0.00103018596537361
[31593] chrX 153941500-153959500 * | 0.52077899590599 0.0082391986288855
[31594] chrX 153959500-153960500 * | 0.536143354146812 0.000926987273521345
Variability2 TStat Pval FDR
<numeric> <numeric> <numeric> <numeric>
[1] 0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
[2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
[3] 0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
[4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
[5] 0.0405524614744011 14.4292755824871 1.50202844679026e-39 1.72121316754705e-37
... ... ... ... ...
[31590] 0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
[31591] 0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
[31592] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
[31593] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
[31594] 0.00123800124356212 11.8316525493 1.44937080755876e-28 5.40902049408846e-27
VarQuantile1 VarQuantile2 value
<numeric> <numeric> <numeric>
[1] 0.497425136231847 0.922787369881207 0.674068131604548
[2] 0.897581458618855 0.553941154713533 0.555556855508682
[3] 0.933085751311052 0.906216767401199 0.556849339818455
[4] 0.696956276435098 0.581836004288129 0.572146747607589
[5] 0.82168833546183 0.961734834930109 0.653852889650734
... ... ... ...
[31590] 0.783751168257537 0.763801275165515 0.782117525371192
[31591] 0.190906567848586 0.763801275165515 0.538505319617638
[31592] 0.109280729132277 0.113733125690822 0.544862329728945
[31593] 0.664993931234254 0.14330607891167 0.52077899590599
[31594] 0.0921534428771049 0.14330607891167 0.536143354146812
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
類似地源祈,如果我們進(jìn)一步降低分辨率煎源,使resolution = 10000,我們將識(shí)別出更少的協(xié)同可接近性相互作用:
> cA <- getCoAccessibility(
ArchRProj = projHeme5,
corCutOff = 0.5,
resolution = 10000,
returnLoops = TRUE
)
> cA[[1]]
GRanges object with 21257 ranges and 9 metadata columns:
seqnames ranges strand | correlation Variability1
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 845000-855000 * | 0.674068131604548 0.00454332740253749
[2] chr1 895000 * | 0.555556855508682 0.0263183019286269
[3] chr1 965000-1005000 * | 0.556849339818455 0.0328129362145881
[4] chr1 975000 * | 0.572146747607589 0.0093636446649745
[5] chr1 1075000 * | 0.653852889650734 0.0171476615215405
... ... ... ... . ... ...
[21253] chrX 153275000-153305000 * | 0.782117525371192 0.0140283037671134
[21254] chrX 153305000-153345000 * | 0.538505319617638 0.00154044262912208
[21255] chrX 153595000 * | 0.544862329728945 0.00103018596537361
[21256] chrX 153945000-153955000 * | 0.52077899590599 0.0082391986288855
[21257] chrX 153955000-153965000 * | 0.536143354146812 0.000926987273521345
Variability2 TStat Pval FDR
<numeric> <numeric> <numeric> <numeric>
[1] 0.0306386713792158 14.8753870882021 1.57568490812292e-41 2.25369520385051e-39
[2] 0.00550013127461839 12.2600711823064 2.64734422862833e-30 1.16921566549878e-28
[3] 0.0276923110094824 12.2885938249175 2.02269876477295e-30 9.03837469990896e-29
[4] 0.00607430839411806 12.6261781901177 8.16876834093376e-32 4.18323084851364e-30
[5] 0.0405524614744011 14.4292755824871 1.50202844679026e-39 1.72121316754705e-37
... ... ... ... ...
[21253] 0.012680152275175 17.2598293746197 1.96185228502518e-52 1.30520421001205e-49
[21254] 0.012680152275175 11.8837765839787 8.94202281836675e-29 3.40710822672161e-27
[21255] 0.00105524816663243 12.0240635693672 2.42373148023141e-29 9.76842969452669e-28
[21256] 0.00123800124356212 11.4925907164108 3.25999116183565e-27 1.06325255294007e-25
[21257] 0.00123800124356212 11.8316525493 1.44937080755876e-28 5.40902049408846e-27
VarQuantile1 VarQuantile2 value
<numeric> <numeric> <numeric>
[1] 0.497425136231847 0.922787369881207 0.674068131604548
[2] 0.897581458618855 0.553941154713533 0.555556855508682
[3] 0.933085751311052 0.906216767401199 0.556849339818455
[4] 0.696956276435098 0.581836004288129 0.572146747607589
[5] 0.82168833546183 0.961734834930109 0.653852889650734
... ... ... ...
[21253] 0.783751168257537 0.763801275165515 0.782117525371192
[21254] 0.190906567848586 0.763801275165515 0.538505319617638
[21255] 0.109280729132277 0.113733125690822 0.544862329728945
[21256] 0.664993931234254 0.14330607891167 0.52077899590599
[21257] 0.0921534428771049 0.14330607891167 0.536143354146812
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
一旦我們?cè)贏rchRProject中添加了可接近性信息香缺,我們就可以在繪制browser tracks時(shí)使用它作為loop track手销。我們通過plotBrowserTrack()
函數(shù)的loop
參數(shù)來實(shí)現(xiàn)這一點(diǎn)。這里图张,我們使用getCoAccessibility()
的默認(rèn)參數(shù)锋拖,包括corCutOff = 0.5
、resolution = 1000
和returnLoops = TRUE
:
> markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
> p <- plotBrowserTrack(
ArchRProj = projHeme5,
groupBy = "Clusters2",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000,
loops = getCoAccessibility(projHeme5)
)
畫圖:
> grid::grid.newpage()
> grid::grid.draw(p$CD14)
保存圖片:
> plotPDF(plotList = p,
name = "Plot-Tracks-Marker-Genes-with-CoAccessibility.pdf",
ArchRProj = projHeme5,
addDOC = FALSE, width = 5, height = 5)
(三)ArchR的Peak2GeneLinkage分析
這一部分請(qǐng)注意祸轮!你需要安裝最新版的ArchR兽埃,如果你安裝的是1.0.0,這一部分是要報(bào)錯(cuò)的适袜!安裝新版的代碼:
devtools::install_github("GreenleafLab/ArchR", ref="release_1.0.1", repos = BiocManager::repositories())
安裝后為了確定確實(shí)是新版了柄错,檢查一下:
packageVersion("ArchR")
[1] ‘1.0.1’
與共可接近性相似,ArchR也可以鑒定所謂的“peak-to-gene links”苦酱。peak-to-gene links和共可接近性之間的主要區(qū)別是售貌,共可接近性是一種僅ATAC-seq分析,尋找可接近性在兩個(gè)峰之間的相關(guān)性疫萤,而peak-to-gene linkage利用整合的scRNA-seq數(shù)據(jù)來尋找峰可接近性和基因表達(dá)之間的相關(guān)性颂跨。這些代表了一個(gè)類似問題的正交方法。然而扯饶,由于peak-to-gene鏈接與scATAC-seq和scRNA-seq數(shù)據(jù)相關(guān)恒削,我們通常認(rèn)為這些links與基因調(diào)控相互作用更相關(guān)。
為了識(shí)別ArchR中的peak-to-gene links帝际,我們使用addPeak2GeneLinks()
函數(shù):
> projHeme5 <- addPeak2GeneLinks(
ArchRProj = projHeme5,
reducedDims = "IterativeLSI"
)
然后蔓同,我們可以檢索這些peak-to-gene links,其方式類似于使用getPeak2GeneLinks()
函數(shù)檢索協(xié)同可接近性相互作用蹲诀。正如我們?cè)谇懊婵吹降陌吡唬@個(gè)函數(shù)允許用戶指定相關(guān)性的cutoff和linkages的分辨率,首先我們嘗試使用returnLoops = FALSE:
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1,
returnLoops = FALSE
)
當(dāng)returnLoops
設(shè)置為false時(shí)脯爪,該函數(shù)將返回一個(gè)DataFrame對(duì)象则北,類似于getCoAccessibility()
返回的DataFrame對(duì)象矿微。主要區(qū)別在于scATAC-seq峰的索引存儲(chǔ)在idxATAC
列中,而scRNA-seq基因的索引存儲(chǔ)在idxRNA
列中尚揣。
這個(gè)peak-to-gene linkage的dataframe還有一個(gè)metadata組件涌矢,該組件包含相關(guān)峰的GRanges對(duì)象。上述idxATAC的各項(xiàng)指標(biāo)均適用于該對(duì)象快骗。
> metadata(p2g)[[1]]
GRanges object with 140865 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 752495-752995 *
[2] chr1 757871-758371 *
[3] chr1 762690-763190 *
[4] chr1 773670-774170 *
[5] chr1 801006-801506 *
... ... ... ...
[140861] chrX 154807254-154807754 *
[140862] chrX 154840907-154841407 *
[140863] chrX 154841994-154842494 *
[140864] chrX 154862043-154862543 *
[140865] chrX 154996991-154997491 *
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
如果我們?cè)O(shè)置returnLoops = TRUE
娜庇,那么getPeak2GeneLinks()
將返回一個(gè)loop track GRanges對(duì)象,該對(duì)象連接peak和gene方篮。在共可接近性方面名秀,IRanges對(duì)象的開始和結(jié)束代表了峰和被連接基因的位置。當(dāng)resolution = 1時(shí)藕溅,它將峰的中心連接到基因的單堿基TSS上匕得。
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1,
returnLoops = TRUE
)
> p2g[[1]]
GRanges object with 151425 ranges and 2 metadata columns:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 752745-901877 * | 0.45482431317995
[2] chr1 752745-935552 * | 0.323287626544222
[3] chr1 752745-948847 * | 0.452241013027625
[4] chr1 762902-845576 * | 0.399410417826805
[5] chr1 762902-846795 * | 0.277657378194161
... ... ... ... . ...
[151421] chrX 154299547-154322956 * | 0.339960030862308
[151422] chrX 154299547-154417246 * | 0.184827972853627
[151423] chrX 154299547-154445180 * | 0.234457969710443
[151424] chrX 154322956-154444701 * | 0.272574364121956
[151425] chrX 154444701-154493813 * | 0.262058933222826
FDR
<numeric>
[1] 3.64903693825748e-25
[2] 1.36105300270992e-12
[3] 7.3782745193054e-25
[4] 3.72871411563353e-19
[5] 1.80208567788527e-09
... ...
[151421] 7.03940344317308e-14
[151422] 9.55454380815343e-05
[151423] 5.1496329871895e-07
[151424] 3.70925182517263e-09
[151425] 1.57206595271312e-08
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
我們也可以通過設(shè)置resolution = 1000
來降低這些links的分辨率。這主要用于在browser tracks時(shí)繪制links巾表,因?yàn)樵谀承┣闆r下汁掠,許多峰的附近都鏈接到同一基因,這可能很難可視化集币。
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 1000,
returnLoops = TRUE
)
> p2g[[1]]
GRanges object with 37320 ranges and 2 metadata columns:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 752500-901500 * | 0.45482431317995
[2] chr1 752500-948500 * | 0.452241013027625
[3] chr1 762500-856500 * | 0.510133486947831
[4] chr1 762500-901500 * | 0.465953111794918
[5] chr1 762500-999500 * | 0.622693553092798
... ... ... ... . ...
[37316] chrX 153881500-153980500 * | 0.686663730276686
[37317] chrX 153941500-153979500 * | 0.461993816785105
[37318] chrX 153946500-153991500 * | 0.493712263022568
[37319] chrX 153980500-153991500 * | 0.54404072487661
[37320] chrX 153991500-154012500 * | 0.454090683118072
FDR
<numeric>
[1] 3.64903693825748e-25
[2] 7.3782745193054e-25
[3] 2.15908329887624e-32
[4] 1.63066517085523e-26
[5] 4.50860678157444e-52
... ...
[37316] 2.08980297150101e-67
[37317] 4.99062691501854e-26
[37318] 4.19132973592781e-30
[37319] 1.54059545203608e-37
[37320] 4.45896733710535e-25
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
繼續(xù)降低分辨率考阱,甚至可能會(huì)減少總的peak-to-gene links鑒定數(shù)量。
> p2g <- getPeak2GeneLinks(
ArchRProj = projHeme5,
corCutOff = 0.45,
resolution = 10000,
returnLoops = TRUE
)
> p2g[[1]]
GRanges object with 30123 ranges and 2 metadata columns:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 755000-905000 * | 0.45482431317995
[2] chr1 755000-945000 * | 0.452241013027625
[3] chr1 765000-855000 * | 0.510133486947831
[4] chr1 765000-905000 * | 0.465953111794918
[5] chr1 765000-995000 * | 0.622693553092798
... ... ... ... . ...
[30119] chrX 153885000-153985000 * | 0.686663730276686
[30120] chrX 153945000-153975000 * | 0.461993816785105
[30121] chrX 153945000-153995000 * | 0.493712263022568
[30122] chrX 153985000-153995000 * | 0.54404072487661
[30123] chrX 153995000-154015000 * | 0.454090683118072
FDR
<numeric>
[1] 3.64903693825748e-25
[2] 7.3782745193054e-25
[3] 2.15908329887624e-32
[4] 1.63066517085523e-26
[5] 4.50860678157444e-52
... ...
[30119] 2.08980297150101e-67
[30120] 4.99062691501854e-26
[30121] 4.19132973592781e-30
[30122] 1.54059545203608e-37
[30123] 4.45896733710535e-25
-------
seqinfo: 23 sequences from an unspecified genome; no seqlengths
(1)用browser tracks繪制peak-to-gene links
為了將這些peak-to-gene鏈接繪制成browser track惠猿,我們使用了與前一節(jié)中顯示的可接近性相同的分析流程羔砾。這里我們使用plotBrowserTrack()
函數(shù):
> markerGenes <- c(
"CD34", #Early Progenitor
"GATA1", #Erythroid
"PAX5", "MS4A1", #B-Cell Trajectory
"CD14", #Monocytes
"CD3D", "CD8A", "TBX21", "IL7R" #TCells
)
> p <- plotBrowserTrack(
ArchRProj = projHeme5,
groupBy = "Clusters2",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000,
loops = getPeak2GeneLinks(projHeme5)
)
畫某一個(gè)基因的peak-to-gene links:
> grid::grid.newpage()
> grid::grid.draw(p$CD14)
保存圖片:
> plotPDF(plotList = p,
name = "Plot-Tracks-Marker-Genes-with-Peak2GeneLinks.pdf",
ArchRProj = projHeme5,
addDOC = FALSE, width = 5, height = 5)
(2)peak-to-gene links分析的熱圖
為了可視化所有peak-to-gene links的對(duì)應(yīng)關(guān)系,我們可以繪制一個(gè)peak-to-gene熱圖偶妖,其中包含兩個(gè)并排的熱圖姜凄,一個(gè)用于我們的scATAC-seq數(shù)據(jù),另一個(gè)用于我們的scRNA-seq數(shù)據(jù)趾访。為此态秧,我們使用plotPeak2GeneHeatmap()
:
> p <- plotPeak2GeneHeatmap(ArchRProj = projHeme5, groupBy = "Clusters2")
> p
根據(jù)傳遞給參數(shù)“k”的值,使用k-means聚類對(duì)heatmap行進(jìn)行聚類扼鞋,參數(shù)“k”默認(rèn)為25申鱼,如下圖所示:
(四)鑒定正向調(diào)控分子
ATAC-seq允許對(duì)在含有DNA結(jié)合motif的位點(diǎn)上染色質(zhì)可接近性發(fā)生巨大變化的TFs進(jìn)行無偏差的鑒定。然而云头,當(dāng)通過位置權(quán)重矩陣(PWMs)進(jìn)行聚合時(shí)捐友,TFs家族(例如GATA因子)在其結(jié)合motif上具有相似的特征。
這種motif相似性使得識(shí)別可能在預(yù)測(cè)結(jié)合位點(diǎn)上引起染色質(zhì)可接近性變化的特定TFs具有挑戰(zhàn)性溃槐。為了規(guī)避這一難題匣砖,我們之前用ATAC-seq和RNA-seq來鑒定TFs的基因表達(dá)與其相應(yīng)motif可接近性的改變呈正相關(guān)。我們稱這些TFs為“正向調(diào)控分子”。然而猴鲫,這種分析依賴于匹配的基因表達(dá)數(shù)據(jù)对人,這可能在所有的實(shí)驗(yàn)中并不容易得到。為了克服這種依賴性拂共,ArchR可以識(shí)別出推斷的基因評(píng)分與其chromVAR TF偏差z-score相關(guān)的TFs牺弄。為了實(shí)現(xiàn)這一點(diǎn),ArchR將低重疊細(xì)胞聚集的TF motif的染色質(zhì)偏差z分?jǐn)?shù)與TF基因活性分?jǐn)?shù)關(guān)聯(lián)起來宜狐。當(dāng)使用與ArchR結(jié)合的scRNA-seq整合時(shí)势告,可以使用TF的基因表達(dá),而不是推斷的基因活性評(píng)分抚恒。
(1)鑒定偏移的TF motif
識(shí)別正向調(diào)控的TF的第一部分是識(shí)別偏移的TF motif培慌。我們?cè)谇耙徽轮羞M(jìn)行了此分析,為所有的motifs創(chuàng)建了一個(gè)chromVAR偏差和偏差z-score的MotifMatrix柑爸。我們可以通過使用getGroupSE()
函數(shù)來獲得這些按cluster平均的數(shù)據(jù),該函數(shù)返回一個(gè)SummarizedExperiment:
> seGroupMotif <- getGroupSE(ArchRProj = projHeme5, useMatrix = "MotifMatrix", groupBy = "Clusters2")
因?yàn)檫@個(gè)SummarizedExperiment
對(duì)象來自MotifMatrix
盒音,所以有兩個(gè)seqname -“deviations”和“z”-對(duì)應(yīng)于原始偏離和來自chromVAR的偏離z-score表鳍。
> seGroupMotif
class: SummarizedExperiment
dim: 1740 11
metadata(0):
assays(1): MotifMatrix
rownames(1740): f1 f2 ... f1739 f1740
rowData names(3): seqnames idx name
colnames(11): B CD4.M ... PreB Progenitor
colData names(20): TSSEnrichment ReadsInTSS ... FRIP nCells
我們可以把這個(gè)SummarizedExperiment取子集化,只取偏差z-score:
> seZ <- seGroupMotif[rowData(seGroupMotif)$seqnames=="z",]
然后我們可以確定所有cluster之間z-score的最大delta祥诽。這將有助于分層motifs的(基于不同cluster之間的偏差程度):
> rowData(seZ)$maxDelta <- lapply(seq_len(ncol(seZ)), function(x){
rowMaxs(assay(seZ) - assay(seZ)[,x])
}) %>% Reduce("cbind", .) %>% rowMaxs
(2)鑒定TF motifs和TF基因分?jǐn)?shù)/表達(dá)的相關(guān)性
為了識(shí)別TFs motif可接近性與其自身基因活性(通過基因得分或基因表達(dá))的相關(guān)性譬圣,我們使用correlateMatrices()
函數(shù)并提供我們感興趣的兩個(gè)矩陣,在本例中為GeneScoreMatrix
和MotifMatrix
雄坪。如前所述厘熟,這些相關(guān)性是在reducedDims
參數(shù)中指定的低維空間中確定的許多低重疊的細(xì)胞集合中確定的:
> corGSM_MM <- correlateMatrices(
ArchRProj = projHeme5,
useMatrix1 = "GeneScoreMatrix",
useMatrix2 = "MotifMatrix",
reducedDims = "IterativeLSI"
)
此函數(shù)返回一個(gè)DataFrame對(duì)象,該對(duì)象包含來自每個(gè)矩陣的元素以及跨低重疊細(xì)胞聚合的相關(guān)性:
> corGSM_MM
DataFrame with 825 rows and 14 columns
GeneScoreMatrix_name MotifMatrix_name cor
<array> <array> <numeric>
1 HES4 HES4_95 0.0964222952027804
2 HES5 HES5_98 0.389610876712439
3 PRDM16 PRDM16_211 0.510854729692758
4 TP73 TP73_705 0.528529747990005
5 TP73-AS1 TP73_705 -0.122421749585615
... ... ... ...
821 TFDP3 TFDP3_309 -0.114905134028301
822 ZNF75D ZNF75D_272 -0.0780911768193981
823 ZIC3 ZIC3_215 0.319287832283537
824 SOX3 SOX3_759 -0.0830413458998048
825 MECP2 MECP2_645 0.143645914605916
padj pval GeneScoreMatrix_seqnames
<numeric> <numeric> <Rle>
1 1 0.0330289085440269 chr1
2 2.91738527794111e-16 3.56648566985466e-19 chr1
3 6.00864881004139e-31 7.34553644259338e-34 chr1
4 1.26528717256258e-33 1.54680583442858e-36 chr1
5 1 0.0067196460862283 chr1
... ... ... ...
821 1 0.0109948840206202 chrX
822 1 0.0845143633274378 chrX
823 3.87543709985036e-10 4.73769816607624e-13 chrX
824 1 0.0665364665390267 chrX
825 1 0.00144806861367396 chrX
GeneScoreMatrix_start GeneScoreMatrix_end GeneScoreMatrix_strand
<array> <array> <array>
1 935552 934342 2
2 2461684 2460184 2
3 2985742 3355185 1
4 3569129 3652765 1
5 3663937 3652548 2
... ... ... ...
821 132352376 132350697 2
822 134429965 134419723 2
823 136648346 136654259 1
824 139587225 139585152 2
825 153363188 153287264 2
GeneScoreMatrix_idx GeneScoreMatrix_matchName MotifMatrix_seqnames
<array> <array> <Rle>
1 15 HES4 z
2 74 HES5 z
3 82 PRDM16 z
4 89 TP73 z
5 90 TP73 z
... ... ... ...
821 697 TFDP3 z
822 728 ZNF75D z
823 753 ZIC3 z
824 765 SOX3 z
825 874 MECP2 z
MotifMatrix_idx MotifMatrix_matchName
<array> <array>
1 95 HES4
2 98 HES5
3 211 PRDM16
4 705 TP73
5 705 TP73
... ... ...
821 309 TFDP3
822 272 ZNF75D
823 215 ZIC3
824 759 SOX3
825 645 MECP2
我們可以使用GeneIntegrationMatrix
代替GeneScoreMatrix
進(jìn)行同樣的分析:
> corGIM_MM <- correlateMatrices(
ArchRProj = projHeme5,
useMatrix1 = "GeneIntegrationMatrix",
useMatrix2 = "MotifMatrix",
reducedDims = "IterativeLSI"
)
> corGIM_MM
DataFrame with 798 rows and 14 columns
GeneIntegrationMatrix_name MotifMatrix_name cor
<array> <array> <numeric>
1 HES4 HES4_95 -0.171384934997311
2 HES5 HES5_98 -0.169840367432083
3 PRDM16 PRDM16_211 0.252222396399267
4 TP73 TP73_705 -0.254496075318124
5 HES2 HES2_19 -0.320907300946354
... ... ... ...
794 TFDP3 TFDP3_309 NA
795 ZNF75D ZNF75D_272 0.281512276845826
796 ZIC3 ZIC3_215 NA
797 SOX3 SOX3_759 NA
798 MECP2 MECP2_645 0.189446439244001
padj pval GeneIntegrationMatrix_seqnames
<numeric> <numeric> <Rle>
1 0.0745474147720114 0.000139863817583511 chr1
2 0.0857722946054064 0.000160923629653671 chr1
3 8.31213157194671e-06 1.55949935683803e-08 chr1
4 6.10714657063681e-06 1.14580611081366e-08 chr1
5 1.89680123127737e-10 3.55872651271552e-13 chr1
... ... ... ...
794 NA NA chrX
795 1.24170853148747e-07 2.32965953374759e-10 chrX
796 NA NA chrX
797 NA NA chrX
798 0.0132063045846304 2.47773069130027e-05 chrX
GeneIntegrationMatrix_start GeneIntegrationMatrix_end
<array> <array>
1 935552 934342
2 2461684 2460184
3 2985742 3355185
4 3569129 3652765
5 6484730 6472498
... ... ...
794 132352376 132350697
795 134429965 134419723
796 136648346 136654259
797 139587225 139585152
798 153363188 153287264
GeneIntegrationMatrix_strand GeneIntegrationMatrix_idx
<array> <array>
1 2 8
2 2 53
3 1 59
4 1 64
5 2 81
... ... ...
794 2 562
795 2 576
796 1 595
797 2 602
798 2 680
GeneIntegrationMatrix_matchName MotifMatrix_seqnames MotifMatrix_idx
<array> <Rle> <array>
1 HES4 z 95
2 HES5 z 98
3 PRDM16 z 211
4 TP73 z 705
5 HES2 z 19
... ... ... ...
794 TFDP3 z 309
795 ZNF75D z 272
796 ZIC3 z 215
797 SOX3 z 759
798 MECP2 z 645
MotifMatrix_matchName
<array>
1 HES4
2 HES5
3 PRDM16
4 TP73
5 HES2
... ...
794 TFDP3
795 ZNF75D
796 ZIC3
797 SOX3
798 MECP2
(3)添加Maximum Delta Deviation到相關(guān)性Dataframe
對(duì)于每一個(gè)相關(guān)分析维哈,我們可以用我們?cè)诓襟E1中計(jì)算出的cluster間觀察到的最大delta注釋每個(gè)motif绳姨。
> corGSM_MM$maxDelta <- rowData(seZ)[match(corGSM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
> corGIM_MM$maxDelta <- rowData(seZ)[match(corGIM_MM$MotifMatrix_name, rowData(seZ)$name), "maxDelta"]
(4)鑒定正向調(diào)控TF
我們可以利用所有這些信息來鑒定正向調(diào)控的TF。在下面的例子中阔挠,我們認(rèn)為motif與基因評(píng)分(或基因表達(dá))的相關(guān)性大于0.5飘庄,且調(diào)整后的p值小于0.01,且偏差z-score的聚類間最大差異位于前四分位數(shù)的TFs為正調(diào)控因子购撼。
我們應(yīng)用這些標(biāo)準(zhǔn)來進(jìn)行選擇跪削,并做一些文本處理來提取TF名稱:
> corGSM_MM <- corGSM_MM[order(abs(corGSM_MM$cor), decreasing = TRUE), ]
> corGSM_MM <- corGSM_MM[which(!duplicated(gsub("\\-.*","",corGSM_MM[,"MotifMatrix_name"]))), ]
> corGSM_MM$TFRegulator <- "NO"
> corGSM_MM$TFRegulator[which(corGSM_MM$cor > 0.5 & corGSM_MM$padj < 0.01 & corGSM_MM$maxDelta > quantile(corGSM_MM$maxDelta, 0.75))] <- "YES"
> sort(corGSM_MM[corGSM_MM$TFRegulator=="YES",1])
[1] "ATOH1" "BCL11A" "CEBPA-DT" "CEBPB" "CEBPD" "CREB1"
[7] "CREB3L4" "EBF1" "EGR2" "EOMES" "ERF" "ESR1"
[13] "ETS1" "FUBP1" "GABPA" "GATA1" "GATA2" "GATA5"
[19] "GATA6" "IRF1" "JDP2" "KLF11" "KLF2" "LYL1"
[25] "MECOM" "MITF" "MYOD1" "NFE2" "NFIA" "NFIC"
[31] "NFIX" "PAX5" "POU2F1" "PRDM16" "RUNX2" "SIX5"
[37] "SMAD1" "SP4" "SPI1" "SPIB" "TAL1" "TCF15"
[43] "TCF23" "TFAP2C" "TWIST1" "YY1"
從基因分?jǐn)?shù)和motif偏差z-score中識(shí)別出這些正向調(diào)控的TF,我們可以在點(diǎn)圖中突出它們迂求。
> p <- ggplot(data.frame(corGSM_MM), aes(cor, maxDelta, color = TFRegulator)) +
geom_point() +
theme_ArchR() +
geom_vline(xintercept = 0, lty = "dashed") +
scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
xlab("Correlation To Gene Score") +
ylab("Max TF Motif Delta") +
scale_y_continuous(
expand = c(0,0),
limits = c(0, max(corGSM_MM$maxDelta)*1.05)
)
> p
我們可以對(duì)從GeneIntegrationMatrix
中得到的相關(guān)進(jìn)行同樣的分析:
> corGIM_MM <- corGIM_MM[order(abs(corGIM_MM$cor), decreasing = TRUE), ]
> corGIM_MM <- corGIM_MM[which(!duplicated(gsub("\\-.*","",corGIM_MM[,"MotifMatrix_name"]))), ]
> corGIM_MM$TFRegulator <- "NO"
> corGIM_MM$TFRegulator[which(corGIM_MM$cor > 0.5 & corGIM_MM$padj < 0.01 & corGIM_MM$maxDelta > quantile(corGIM_MM$maxDelta, 0.75))] <- "YES"
> sort(corGIM_MM[corGIM_MM$TFRegulator=="YES",1])
[1] "BACH1" "CEBPA" "CEBPB" "CEBPD" "CTCF" "EBF1" "EOMES"
[8] "ETS1" "FOS" "FOSB" "FOSL1" "FOSL2" "GATA1" "GATA2"
[15] "IRF1" "IRF2" "IRF9" "JDP2" "KLF2" "MITF" "NFE2"
[22] "NFIA" "NFIB" "NFIC" "NFIX" "NFKB2" "NR4A1" "NRF1"
[29] "PAX5" "RARA" "RUNX1" "SP4" "SPI1" "STAT2" "TCF3"
[36] "TCF4" "TGIF2"
> p <- ggplot(data.frame(corGIM_MM), aes(cor, maxDelta, color = TFRegulator)) +
geom_point() +
theme_ArchR() +
geom_vline(xintercept = 0, lty = "dashed") +
scale_color_manual(values = c("NO"="darkgrey", "YES"="firebrick3")) +
xlab("Correlation To Gene Expression") +
ylab("Max TF Motif Delta") +
scale_y_continuous(
expand = c(0,0),
limits = c(0, max(corGIM_MM$maxDelta)*1.05)
)
> p