Signac是Seurat的一個擴(kuò)展功能R包胧沫,可以用來分析昌简、解釋和探索
單細(xì)胞染色質(zhì)數(shù)據(jù)集
占业。目前,Signac包主要專注于分析單細(xì)胞ATAC-seq數(shù)據(jù)
纯赎,之后還會添加其他基于染色質(zhì)調(diào)控的單細(xì)胞測序數(shù)據(jù)分析的新功能谦疾。
Signac包主要支持以下功能:
- 單細(xì)胞數(shù)據(jù)質(zhì)控指標(biāo)的計(jì)算
- 數(shù)據(jù)降維、可視化和細(xì)胞聚類
- 細(xì)胞類型特異性peak的鑒定
- “pseudo-bulk” coverage tracks的可視化
- 多樣本scATAC-seq數(shù)據(jù)集的整合
- 與scRNA-seq數(shù)據(jù)集的整合
- Motif富集分析
本教程中犬金,我們將學(xué)習(xí)處理10x Genomics平臺測序產(chǎn)生的人外周血單核細(xì)胞(PBMC)的scATAC-seq數(shù)據(jù)念恍,原始的示例數(shù)據(jù)可以從10x Genomics官網(wǎng)進(jìn)行下載:
- The Raw data
- The Metadata
- The fragments file
- The fragments file index
安裝Signac及其依賴包
# 安裝bioconductor包
# Install bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
# 安裝Signac包
# Current release
install.packages("Signac")
# Development version
install.packages("devtools")
devtools::install_github("timoast/signac", ref = "develop")
#安裝參考基因組的注釋信息
# Human
BiocManager::install(c('BSgenome.Hsapiens.UCSC.hg19', 'EnsDb.Hsapiens.v75'))
# Mouse
BiocManager::install(c('BSgenome.Mmusculus.UCSC.mm10', 'EnsDb.Mmusculus.v79'))
加載所需的R包
library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Hsapiens.v75)
library(ggplot2)
library(patchwork)
set.seed(1234)
數(shù)據(jù)加載與預(yù)處理
Signac主要讀取兩個輸入文件,它們可以用CellRanger處理產(chǎn)生:
Peak/Cell matrix.
它類似于用于分析單細(xì)胞RNA-seq的基因表達(dá)計(jì)數(shù)矩陣晚顷。但是峰伙,矩陣中的每一行代表一個基因組區(qū)域(“peak region”)而不是基因,表示開放的染色質(zhì)區(qū)域该默。矩陣中的每個值代表在每個峰內(nèi)映射的每個條形碼(即細(xì)胞)的Tn5切割位點(diǎn)的數(shù)量瞳氓,我們可以在10X Genomics官網(wǎng)上找到更多詳細(xì)的信息。Fragment file.
這個文件存儲了所有單細(xì)胞中唯一片段(unique fragments)的完整列表栓袖。該文件較大匣摘,使用起來也較慢,并且存儲在磁盤上(而不是內(nèi)存中)叽赊。但是恋沃,保留此文件的好處是它包含了與每個細(xì)胞關(guān)聯(lián)的所有片段,與僅映射到peaks的reads數(shù)相反必指。
對于大多數(shù)分析囊咏,我們僅使用peak/cell matrix矩陣即可,但對于某些分析(如塔橡,創(chuàng)建“Gene Activity Matrix”)梅割,我們發(fā)現(xiàn)僅使用peaks映射的reads數(shù)可能會對結(jié)果產(chǎn)生不利的影響。因此葛家,在本教程中户辞,我們同時使用了這兩個文件。首先癞谒,我們使用peak/cell matrix矩陣創(chuàng)建Seurat對象底燎,然后將片段文件(fragment file)的路徑存儲在磁盤上的Seurat對象中:
# 讀取peak/cell matrix矩陣
counts <- Read10X_h5(filename = "/home/dongwei/scATAC-seq/PBMC/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
# 查看counts信息
head(counts)
6 x 9277 sparse Matrix of class "dgCMatrix"
[[ suppressing 25 column names 'AAACGAAAGAGCGAAA-1', 'AAACGAAAGAGTTTGA-1', 'AAACGAAAGCGAGCTA-1' ... ]]
chr1:565113-565543 . . . . . . . . . . . . . . . . . . . . . .
chr1:569179-569635 . . . . . . . . . . . . . . . . . . . . . .
chr1:713534-714806 . . 2 8 . 2 2 . . 2 . . . . . 4 . . . . . .
chr1:752436-753020 . . . . . . . . . . . . . . . . . . 1 . . .
chr1:762144-763353 . . 4 2 . . . 2 . . . . . . 2 . 2 . 4 . . .
chr1:779610-780252 . . . . . . . . . . . . . . . . . . . . . .
chr1:565113-565543 . . . ......
chr1:569179-569635 . . . ......
chr1:713534-714806 2 . . ......
chr1:752436-753020 . 2 . ......
chr1:762144-763353 . 2 2 ......
chr1:779610-780252 . . 1 ......
.....suppressing 9252 columns in show(); maybe adjust 'options(max.print= *, width = *)'
..............................
# 查看peaks和細(xì)胞的數(shù)目
dim(counts)
[1] 80234 9277
# 讀取細(xì)胞注釋信息
metadata <- read.csv(file = "/home/dongwei/scATAC-seq/PBMC/atac_v1_pbmc_10k_singlecell.csv",
header = TRUE,
row.names = 1
)
# 查看metadata信息
names(metadata)
[1] "total"
[2] "duplicate"
[3] "chimeric"
[4] "unmapped"
[5] "lowmapq"
[6] "mitochondrial"
[7] "passed_filters"
[8] "cell_id"
[9] "is__cell_barcode"
[10] "TSS_fragments"
[11] "DNase_sensitive_region_fragments"
[12] "enhancer_region_fragments"
[13] "promoter_region_fragments"
[14] "on_target_fragments"
[15] "blacklist_region_fragments"
[16] "peak_region_fragments"
[17] "peak_region_cutsites"
head(metadata)[,1:5]
total duplicate chimeric unmapped lowmapq
NO_BARCODE 9034214 3902029 106211 1424804 648333
AAACGAAAGAAACGCC-1 3 1 0 1 0
AAACGAAAGAAAGCAG-1 15 0 0 5 3
AAACGAAAGAAAGGGT-1 8 1 0 2 0
AAACGAAAGAAATACC-1 9908 5343 120 121 1146
AAACGAAAGAAATCTG-1 2 0 0 0 0
# 構(gòu)建Seurat對象
pbmc <- CreateSeuratObject(
counts = counts,
assay = 'peaks',
project = 'ATAC',
min.cells = 1,
meta.data = metadata
)
pbmc
An object of class Seurat
79921 features across 9277 samples within 1 assay
Active assay: peaks (79921 features, 0 variable features)
# 指定fragment file路徑
fragment.path <- '/home/dongwei/scATAC-seq/PBMC/atac_v1_pbmc_10k_fragments.tsv.gz'
# 使用SetFragments函數(shù)設(shè)置fragment file路徑
pbmc <- SetFragments(
object = pbmc,
file = fragment.path
)
# 查看Seurat對象
pbmc
An object of class Seurat
79921 features across 9277 samples within 1 assay
Active assay: peaks (79921 features, 0 variable features)
Optional step: Filtering the fragment file
To make downstream steps that use this file faster, we can filter the fragments file to contain only reads from cells that we retain in the analysis. This is optional, but slow, and only needs to be performed once. Running this command writes a new file to disk and indexes the file so it is ready to be used by Signac.
# 指定過濾后的fragment file
fragment_file_filtered <- "/home/dongwei/scATAC-seq/PBMC/atac_v1_pbmc_10k_filtered_fragments.tsv"
# 使用FilterFragments函數(shù)對fragment file進(jìn)行過濾
FilterFragments(
fragment.path = fragment.path,
cells = colnames(pbmc),
output.path = fragment_file_filtered
)
Retaining 9277 cells
Reading fragments
|--------------------------------------------------|
|==================================================|
Sorting fragments
Writing output
Compressing output
Building index
# 使用過濾后的fragment file重新設(shè)置路徑
pbmc <- SetFragments(object = pbmc, file = paste0(fragment_file_filtered, '.bgz'))
計(jì)算QC指標(biāo)進(jìn)行數(shù)據(jù)質(zhì)控
目前,對于scATAC-seq測序數(shù)據(jù)弹砚,我們建議使用以下QC指標(biāo)來評估數(shù)據(jù)質(zhì)量双仍。與scRNA-seq數(shù)據(jù)一樣,這些參數(shù)的預(yù)期值范圍將取決于我們的生物系統(tǒng)桌吃,細(xì)胞生存力等朱沃。
- Nucleosome banding pattern(核小體條帶模式):繪制片段大小(fragment sizes)的直方圖(由配對末端測序讀數(shù)確定)來展示核小體的條帶模式。我們計(jì)算每個單細(xì)胞逗物,并量化單核小體與無核小體片段(存儲為nucleosome_signal)的近似比率搬卒。
-
Transcriptional start site (TSS) enrichment score(轉(zhuǎn)錄起始位點(diǎn)富集得分):
ENCODE
項(xiàng)目已根據(jù)以TSS為中心的片段與TSS側(cè)翼區(qū)域中的片段的比率定義了ATAC-seq定位得分(請參閱https://www.encodeproject.org/data-standards/terms/)。較差的ATAC-seq實(shí)驗(yàn)通常將具有較低的TSS富集得分翎卓。我們可以使用TSSEnrichment
函數(shù)為每個細(xì)胞計(jì)算該指標(biāo)契邀,并將結(jié)果存儲在元數(shù)據(jù)中,列名稱為TSS.enrichment莲祸。 - Total number of fragments in peaks(peaks中片段的總數(shù)):細(xì)胞測序深度/復(fù)雜性的量度蹂安。由于測序深度低,可能需要排除reads數(shù)很少的細(xì)胞锐帜,reads數(shù)極高的細(xì)胞可能代表了doublets或核團(tuán)塊。
- Fraction of fragments in peaks(peaks中片段的比例):表示落在ATAC-seq peaks內(nèi)的總片段比例畜号。比例較低的細(xì)胞(即<15-20%)通常代表應(yīng)刪除的低質(zhì)量細(xì)胞或技術(shù)誤差缴阎。
-
Ratio reads in ‘blacklist’ sites(比對到“blacklist”區(qū)域的reads比率):
ENCODE
項(xiàng)目提供了一個“blacklist“”區(qū)域列表,這些區(qū)域表示通常與人為信號關(guān)聯(lián)的讀取简软。reads比對到這些區(qū)域的比例較高的細(xì)胞(與reads比對到peaks區(qū)域比例相比)通常代表了技術(shù)誤差蛮拔,應(yīng)將其刪除。Signac包中包含了針對人類(hg19和GRCh38)痹升,小鼠(mm10)建炫,果蠅(dm3)和秀麗隱桿線蟲(ce10)的ENCODE“blacklist”區(qū)域。
# 使用NucleosomeSignal函數(shù)計(jì)算Nucleosome banding pattern
pbmc <- NucleosomeSignal(object = pbmc)
# 計(jì)算peaks中reads的比例
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
# 計(jì)算比對到“blacklist”區(qū)域的reads比率
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments
# 質(zhì)控指標(biāo)的可視化
p1 <- VlnPlot(pbmc, c('pct_reads_in_peaks', 'peak_region_fragments'), pt.size = 0.1)
p2 <- VlnPlot(pbmc, c('blacklist_ratio', 'nucleosome_signal'), pt.size = 0.1) & scale_y_log10()
p1 | p2
我們還可以查看所有細(xì)胞的片段長度的周期性疼蛾,并按具有高或低的核小體信號強(qiáng)度對細(xì)胞進(jìn)行分組肛跌。可以發(fā)現(xiàn)察郁,與單核小體/無核小體比率不同的細(xì)胞具有不同的核小體條帶模式衍慎。
# 根據(jù)核小體條帶信號強(qiáng)度進(jìn)行分組
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 10, 'NS > 10', 'NS < 10')
# 使用FragmentHistogram函數(shù)可視化核小體條帶分布模式
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')
Tn5整合事件在轉(zhuǎn)錄起始位點(diǎn)(TSS)的富集也是評估ATAC-seq實(shí)驗(yàn)中Tn5靶向的重要質(zhì)量控制指標(biāo)。ENCODE聯(lián)盟將TSS富集分?jǐn)?shù)定義為將TSS周圍的Tn5整合位點(diǎn)數(shù)量標(biāo)準(zhǔn)化為側(cè)翼區(qū)域中Tn5整合位點(diǎn)的數(shù)量皮钠。有關(guān)TSS富集得分的更多信息稳捆,請參閱ENCODE文檔(https://www.encodeproject.org/data-standards/terms/)。我們可以使用Signac包中的TSSEnrichment
函數(shù)來計(jì)算每個細(xì)胞的TSS富集得分麦轰。
# create granges object with TSS positions
gene.ranges <- genes(EnsDb.Hsapiens.v75)
seqlevelsStyle(gene.ranges) <- 'UCSC'
gene.ranges <- gene.ranges[gene.ranges$gene_biotype == 'protein_coding', ]
gene.ranges <- keepStandardChromosomes(gene.ranges, pruning.mode = 'coarse')
tss.ranges <- GRanges(
seqnames = seqnames(gene.ranges),
ranges = IRanges(start = start(gene.ranges), width = 2),
strand = strand(gene.ranges)
)
seqlevelsStyle(tss.ranges) <- 'UCSC'
tss.ranges <- keepStandardChromosomes(tss.ranges, pruning.mode = 'coarse')
# to save time use the first 2000 TSSs
# 使用TSSEnrichment函數(shù)計(jì)算TSS富集得分
pbmc <- TSSEnrichment(object = pbmc, tss.positions = tss.ranges[1:2000])
pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 2, 'High', 'Low')
# 使用TSSPlot函數(shù)對TSS富集得分進(jìn)行可視化
TSSPlot(pbmc, group.by = 'high.tss') + ggtitle("TSS enrichment score") + NoLegend()
# Finally we remove cells that are outliers for these QC metrics.
# 根據(jù)不同QC指標(biāo)進(jìn)行數(shù)據(jù)過濾
pbmc <- subset(
x = pbmc,
subset = peak_region_fragments > 3000 &
peak_region_fragments < 20000 &
pct_reads_in_peaks > 15 &
blacklist_ratio < 0.05 &
nucleosome_signal < 10 &
TSS.enrichment > 2
)
pbmc
## An object of class Seurat
## 89307 features across 6978 samples within 1 assay
## Active assay: peaks (89307 features, 0 variable features)
數(shù)據(jù)歸一化與線性降維
Normalization(數(shù)據(jù)歸一化): Signac包使用
frequency-inverse document frequency (TF-IDF)
方法進(jìn)行數(shù)據(jù)歸一化處理乔夯。它將進(jìn)行兩步歸一化的過程,既可以跨細(xì)胞進(jìn)行歸一化以校正細(xì)胞測序深度的差異款侵,也可以跨峰(peaks)進(jìn)行歸一化為更罕見的峰提供更高的值末荐。Feature selection(特征選擇): 正如我們對scRNA-seq數(shù)據(jù)所做的那樣,scATAC-seq數(shù)據(jù)的大部分二進(jìn)制性質(zhì)(binary nature)使其難以執(zhí)行“可變”特征選擇喳坠。取而代之的是鞠评,我們可以選擇僅使用前n%個特征(峰)進(jìn)行數(shù)據(jù)降維,或者使用
FindTopFeatures
函數(shù)刪除少于在n個細(xì)胞中出現(xiàn)的峰壕鹉。這里剃幌,我們將使用所有的peaks聋涨,盡管我們注意到,僅使用部分features(嘗試將min.cutoff設(shè)置為“q75”以使用前25%的峰)會看到非常相似的結(jié)果负乡,并且運(yùn)行速度更快牍白。通過此功能,用于數(shù)據(jù)降維的特征將自動被存儲為Seurat對象中的VariableFeatures抖棘。Dimensional reduction(數(shù)據(jù)降維): 接下來茂腥,我們使用上面選擇的特征(峰)在TD-IDF歸一化矩陣上運(yùn)行奇異值分解(SVD),最后返回對象的低維數(shù)據(jù)表示結(jié)果(對于熟悉scRNA-seq數(shù)據(jù)的用戶切省,您可以認(rèn)為這類似于PCA的輸出)最岗。
# 使用RunTFIDF函數(shù)進(jìn)行數(shù)據(jù)歸一化
pbmc <- RunTFIDF(pbmc)
# 使用FindTopFeatures函數(shù)選擇可變的features
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
# 使用RunSVD函數(shù)進(jìn)行線性降維
pbmc <- RunSVD(
object = pbmc,
assay = 'peaks',
reduction.key = 'LSI_',
reduction.name = 'lsi'
)
線性降維后,第一個LSI成分通常捕獲測序深度(技術(shù)差異)朝捆,而不是生物學(xué)差異般渡。在這種情況下,應(yīng)從下游分析中刪除該成分芙盘。我們可以使用DepthCor
函數(shù)評估每個LSI成分與測序深度之間的相關(guān)性:
DepthCor(pbmc)
從上圖中驯用,我們可以看到第一個LSI成分與細(xì)胞中counts的總數(shù)之間有非常強(qiáng)的相關(guān)性,因此我們將在后續(xù)的分析中刪除此成分儒老。
非線性降維與細(xì)胞聚類
數(shù)據(jù)線性降維后蝴乔,我們可以使用通常用于分析scRNA-seq數(shù)據(jù)的方法來對細(xì)胞進(jìn)行基于圖的聚類,并進(jìn)行非線性降維(如UMAP等)可視化驮樊。其中薇正,函數(shù)RunUMAP
,FindNeighbors
和FindClusters
都來自Seurat包巩剖。
# 使用RunUMAP函數(shù)進(jìn)行UMAP非線性降維
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
# 對細(xì)胞執(zhí)行基于圖的聚類
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
# 使用DimPlot函數(shù)進(jìn)行數(shù)據(jù)可視化
DimPlot(object = pbmc, label = TRUE) + NoLegend()
創(chuàng)建基因表達(dá)活性矩陣(gene activity matrix)
從上圖UMAP降維可視化的結(jié)果中可以看出铝穷,在人的外周血中存在多個細(xì)胞類群。如果我們還對PBMC的scRNA-seq分析結(jié)果熟悉的話佳魔,甚至還可以在scATAC-seq數(shù)據(jù)中發(fā)現(xiàn)某些髓樣和淋巴樣細(xì)胞群體的存在曙聂。但是,在scATAC-seq數(shù)據(jù)中注釋和解釋不同的細(xì)胞簇更具有挑戰(zhàn)性鞠鲜,因?yàn)槲覀儗Ψ蔷幋a基因組區(qū)域功能的了解遠(yuǎn)少于對蛋白質(zhì)編碼區(qū)域(基因)的了解宁脊。
但是,我們可以嘗試通過評估與每個基因相關(guān)的染色質(zhì)可及性來量化基因組中每個基因的表達(dá)活性贤姆,并創(chuàng)建一個基于scATAC-seq數(shù)據(jù)的新基因活性測定方法榆苞。在這里,我們將使用一種簡單的方法來對與基因體和啟動子區(qū)域相交的reads數(shù)進(jìn)行求和(我們也建議您探索Cicero
工具霞捡,它可以實(shí)現(xiàn)類似的目標(biāo))坐漏。
為了創(chuàng)建基因表達(dá)活性矩陣(gene activity matrix),我們從EnsembleDB中提取了人類基因組的基因坐標(biāo),并將其擴(kuò)展到TSS上游2kb區(qū)域(因?yàn)閱幼訁^(qū)域的可及性通常與基因表達(dá)相關(guān))赊琳。然后街夭,我們使用FeatureMatrix
函數(shù)計(jì)算映射到這些區(qū)域的每個細(xì)胞的片段數(shù)。這將獲取任何一組基因組坐標(biāo)躏筏,計(jì)算與每個細(xì)胞中的這些坐標(biāo)相交的reads次數(shù)板丽,并返回一個稀疏矩陣。
# Extend coordinates upstream to include the promoter
genebodyandpromoter.coords <- Extend(x = gene.ranges, upstream = 2000, downstream = 0)
# create a gene by cell matrix
gene.activities <- FeatureMatrix(
fragments = fragment.path,
features = genebodyandpromoter.coords,
cells = colnames(pbmc),
chunk = 20
)
# convert rownames from chromsomal coordinates into gene names
gene.key <- genebodyandpromoter.coords$gene_name
names(gene.key) <- GRangesToString(grange = genebodyandpromoter.coords)
rownames(gene.activities) <- gene.key[rownames(gene.activities)]
# add the gene activity matrix to the Seurat object as a new assay, and normalize it
pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(
object = pbmc,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(pbmc$nCount_RNA)
)
現(xiàn)在趁尼,我們可以對一些典型的marker基因的“活性”進(jìn)行可視化展示埃碱,以幫助我們解釋scATAC-seq數(shù)據(jù)聚類分出的不同細(xì)胞簇。請注意酥泞,該基因“活性”會比scRNA-seq數(shù)據(jù)測量值的噪音大得多砚殿。這是因?yàn)樗鼈兇砹藖碜韵∈枞旧|(zhì)數(shù)據(jù)的預(yù)測值,而且還因?yàn)樗鼈兗俣嘶蝮w/啟動子可及性與基因表達(dá)之間的一般對應(yīng)關(guān)系芝囤,而實(shí)際情況并非總是如此瓮具。盡管如此,我們?nèi)钥梢曰诖藬?shù)據(jù)識別出單核細(xì)胞凡人,B細(xì)胞,T細(xì)胞和NK細(xì)胞的不同類群。
DefaultAssay(pbmc) <- 'RNA'
FeaturePlot(
object = pbmc,
features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3
)
與scRNA-seq數(shù)據(jù)進(jìn)行整合
為了幫助解釋scATAC-seq測序數(shù)據(jù),我們還可以基于來自同一生物系統(tǒng)(人PBMC)的scRNA-seq實(shí)驗(yàn)數(shù)據(jù)對細(xì)胞進(jìn)行分類铛纬。在這里幼驶,我們利用跨模式整合和標(biāo)簽傳輸?shù)姆椒ǎ瑢cATAC-seq數(shù)據(jù)和scRNA-seq數(shù)據(jù)進(jìn)行整合怔檩。我們基于scATAC-seq數(shù)據(jù)的基因活性矩陣和scRNA-seq數(shù)據(jù)集的基因表達(dá)矩陣識別它們之間共享的相關(guān)模式,以鑒定兩種模式下匹配的生物學(xué)狀態(tài)。該過程根據(jù)每個scRNA-seq定義的簇標(biāo)簽返回每個細(xì)胞的分類評分启上。
# Load the pre-processed scRNA-seq data for PBMCs
# 加載預(yù)處理的PBMC的scRNA-seq數(shù)據(jù)
pbmc_rna <- readRDS("/home/dongwei/scRNA-seq/pbmc_10k_v3.rds")
# 使用FindTransferAnchors函數(shù)識別整合的anchors
transfer.anchors <- FindTransferAnchors(
reference = pbmc_rna,
query = pbmc,
reduction = 'cca'
)
# 使用TransferData函數(shù)基于識別出的anchors進(jìn)行數(shù)據(jù)映射整合
predicted.labels <- TransferData(
anchorset = transfer.anchors,
refdata = pbmc_rna$celltype,
weight.reduction = pbmc[['lsi']],
dims = 2:30
)
pbmc <- AddMetaData(object = pbmc, metadata = predicted.labels)
# 數(shù)據(jù)可視化
plot1 <- DimPlot(
object = pbmc_rna,
group.by = 'celltype',
label = TRUE,
repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
plot2 <- DimPlot(
object = pbmc,
group.by = 'predicted.id',
label = TRUE,
repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
plot1 + plot2
pbmc <- subset(pbmc,idents = 14, invert = TRUE)
pbmc <- RenameIdents(
object = pbmc,
'0' = 'CD14 Mono',
'1' = 'CD4 Memory',
'2' = 'CD8 Effector',
'3' = 'CD4 Naive',
'4' = 'CD14 Mono',
'5' = 'CD8 Naive',
'6' = 'DN T',
'7' = 'NK CD56Dim',
'8' = 'pre-B',
'9' = 'CD16 Mono',
'10' = 'pro-B',
'11' = 'DC',
'12' = 'NK CD56bright',
'13' = 'pDC'
)
鑒定不同聚類群之間差異可及性peaks區(qū)域
為了鑒定出不同細(xì)胞簇之間的差異可及性區(qū)域,我們可以進(jìn)行差異可及性(DA)檢驗(yàn)分析店印,并使用不同的函數(shù)進(jìn)行可視化展示冈在。正如Ntranos等人所建議的,我們使用邏輯回歸模型(LR)進(jìn)行DA分析按摘。
# switch back to working with peaks instead of gene activities
DefaultAssay(pbmc) <- 'peaks'
# 使用FindMarkers函數(shù)進(jìn)行差異分析
da_peaks <- FindMarkers(
object = pbmc,
ident.1 = "CD4 Naive",
ident.2 = "CD14 Mono",
min.pct = 0.2,
test.use = 'LR',
latent.vars = 'peak_region_fragments'
)
# 查看差異分析的結(jié)果
head(da_peaks)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## chr14:99721608-99741934 0.000000e+00 0.7794737 0.847 0.024 0.000000e+00
## chr7:142501666-142511108 2.470763e-273 0.7166266 0.750 0.030 2.206565e-268
## chr17:80084198-80086094 1.971072e-263 1.1182759 0.666 0.007 1.760306e-258
## chr14:99695477-99720910 1.285851e-261 0.7604066 0.781 0.023 1.148355e-256
## chr2:113581628-113594911 3.004971e-238 -0.8896770 0.050 0.694 2.683649e-233
## chr6:44025105-44028184 2.263553e-229 -0.8723221 0.056 0.657 2.021511e-224
# 差異分析結(jié)果可視化
plot1 <- VlnPlot(
object = pbmc,
features = rownames(da_peaks)[1],
pt.size = 0.1,
idents = c("CD4 Memory","CD14 Mono")
)
plot2 <- FeaturePlot(
object = pbmc,
features = rownames(da_peaks)[1],
pt.size = 0.1
)
plot1 | plot2
我們還可以使用ClosestFeature
函數(shù)提供EnsDb中基因的注釋信息包券,找到與每個peak最接近的基因。如果我們查看基因列表炫贤,會發(fā)現(xiàn)naive T細(xì)胞中的開放峰與BCL11B和GATA3(T細(xì)胞分化的關(guān)鍵調(diào)控因子)等基因接近溅固,而單核細(xì)胞中開放的峰與CEBPB(單核細(xì)胞分化的關(guān)鍵調(diào)控因子)基因接近。我們還可以對ClosestFeature返回的基因集進(jìn)行GO富集分析兰珍。
# 提取顯著的差異可及性peak區(qū)域信息
open_cd4naive <- rownames(da_peaks[da_peaks$avg_logFC > 0.5, ])
open_cd14mono <- rownames(da_peaks[da_peaks$avg_logFC < -0.5, ])
# 使用ClosestFeature函數(shù)對差異開放性peak區(qū)域進(jìn)行注釋
closest_genes_cd4naive <- ClosestFeature(
regions = open_cd4naive,
annotation = gene.ranges,
sep = c(':', '-')
)
closest_genes_cd14mono <- ClosestFeature(
regions = open_cd14mono,
annotation = gene.ranges,
sep = c(':', '-')
)
# 查看注釋后的結(jié)果
head(closest_genes_cd4naive)
## gene_id gene_name gene_biotype seq_coord_system
## ENSG00000127152 ENSG00000127152 BCL11B protein_coding chromosome
## ENSG00000204983 ENSG00000204983 PRSS1 protein_coding chromosome
## ENSG00000176155 ENSG00000176155 CCDC57 protein_coding chromosome
## ENSG00000127152.1 ENSG00000127152 BCL11B protein_coding chromosome
## ENSG00000134709 ENSG00000134709 HOOK1 protein_coding chromosome
## ENSG00000134954 ENSG00000134954 ETS1 protein_coding chromosome
## symbol entrezid closest_region
## ENSG00000127152 BCL11B 64919 chr14:99635624-99737861
## ENSG00000204983 PRSS1 5645, 5644 chr7:142457319-142460923
## ENSG00000176155 CCDC57 284001 chr17:80059336-80170706
## ENSG00000127152.1 BCL11B 64919 chr14:99635624-99737861
## ENSG00000134709 HOOK1 51361 chr1:60280458-60342050
## ENSG00000134954 ETS1 2113 chr11:128328656-128457453
## query_region distance
## ENSG00000127152 chr14:99721608-99741934 0
## ENSG00000204983 chr7:142501666-142511108 40742
## ENSG00000176155 chr17:80084198-80086094 0
## ENSG00000127152.1 chr14:99695477-99720910 0
## ENSG00000134709 chr1:60279767-60281364 0
## ENSG00000134954 chr11:128334097-128348572 0
head(closest_genes_cd14mono)
## gene_id gene_name gene_biotype seq_coord_system
## ENSG00000125538 ENSG00000125538 IL1B protein_coding chromosome
## ENSG00000181577 ENSG00000181577 C6orf223 protein_coding chromosome
## ENSG00000172216 ENSG00000172216 CEBPB protein_coding chromosome
## ENSG00000138621 ENSG00000138621 PPCDC protein_coding chromosome
## ENSG00000186350 ENSG00000186350 RXRA protein_coding chromosome
## ENSG00000273269 ENSG00000273269 RP11-761B3.1 protein_coding chromosome
## symbol entrezid closest_region
## ENSG00000125538 IL1B 3553 chr2:113587328-113594480
## ENSG00000181577 C6orf223 221416 chr6:43968317-43973695
## ENSG00000172216 CEBPB 1051 chr20:48807376-48809212
## ENSG00000138621 PPCDC 60490 chr15:75315896-75409803
## ENSG00000186350 RXRA 6256 chr9:137208944-137332431
## ENSG00000273269 RP11-761B3.1 NA chr2:47293080-47403650
## query_region distance
## ENSG00000125538 chr2:113581628-113594911 0
## ENSG00000181577 chr6:44025105-44028184 51409
## ENSG00000172216 chr20:48889794-48893313 80581
## ENSG00000138621 chr15:75334903-75336779 0
## ENSG00000186350 chr9:137263243-137268534 0
## ENSG00000273269 chr2:47297968-47301173 0
我們還可以使用CoveragePlot
函數(shù)對按聚類分組的任何基因組區(qū)域繪制覆蓋圖進(jìn)行可視化展示侍郭。These represent ‘pseudo-bulk’ accessibility tracks, where all cells within a cluster have been averaged together, in order to visualize a more robust chromatin landscape (thanks to Andrew Hill for giving the inspiration for this function in his excellent blog post).
# set plotting order
levels(pbmc) <- c("CD4 Naive","CD4 Memory","CD8 Naive","CD8 Effector","DN T","NK CD56bright","NK CD56Dim","pre-B",'pro-B',"pDC","DC","CD14 Mono",'CD16 Mono')
# 使用CoveragePlot函數(shù)繪制peaks的覆蓋圖
CoveragePlot(
object = pbmc,
region = rownames(da_peaks)[c(1,5)],
sep = c(":", "-"),
peaks = StringToGRanges(rownames(pbmc), sep = c(":", "-")),
annotation = gene.ranges,
extend.upstream = 20000,
extend.downstream = 20000,
ncol = 1
)
Working with datasets that were not quantified using CellRanger
10x Genomics的CellRanger軟件可以為每個細(xì)胞生成幾個有用的QC指標(biāo),以及峰/細(xì)胞矩陣和索引片段文件。在上面的示例中亮元,我們直接使用了CellRanger的輸出結(jié)果猛计。但是,對于不是其產(chǎn)生的結(jié)果苹粟,Signac中也提供了許多類似的替代功能有滑。
Generating a peak/cell or bin/cell matrix
我們可以使用FeatureMatrix
函數(shù)生成一個計(jì)數(shù)矩陣。
# not run
# peak_ranges should be a set of genomic ranges spanning the set of peaks to be quantified per cell
peak_matrix <- FeatureMatrix(
fragments = fragment.path,
features = peak_ranges
)
為了方便起見嵌削,我們還提供了GenomeBinMatrix
函數(shù)毛好,該函數(shù)將生成一個跨越整個基因組的一組基因組范圍,并在內(nèi)部運(yùn)行FeatureMatrix
以生成基因組bin/細(xì)胞矩陣
# not run
bin_matrix <- GenomeBinMatrix(
fragments = fragment.path,
genome = 'hg19',
binsize = 5000
)
Counting fraction of reads in peaks
對于一個給定的峰/細(xì)胞或bin/細(xì)胞矩陣苛秕,我們可以使用FRiP
函數(shù)計(jì)算每個細(xì)胞中peaks比對到的reads的比例肌访。
# not run
pbmc <- FRiP(
object = pbmc,
peak.assay = 'peaks',
bin.assay = 'bins'
)
Counting fragments in genome blacklist regions
我們知道,基因組blacklist區(qū)域內(nèi)比對到的reads的比率艇劫,可以作為低質(zhì)量細(xì)胞預(yù)測的指標(biāo)吼驶。Signac包中提供了幾個參考基因組(hg19,hg38店煞,mm9蟹演,mm10,ce10顷蟀,ce11酒请,dm3,dm6)的blacklist區(qū)域坐標(biāo)鸣个,這些區(qū)域信息由ENCODE聯(lián)盟提供羞反。我們可以使用FractionCountsInRegion
函數(shù)計(jì)算每個細(xì)胞在給定區(qū)域集中的比對到的reads的分?jǐn)?shù)。
# not run
pbmc$blacklist_fraction <- FractionCountsInRegion(
object = pbmc,
assay = 'peaks',
regions = blacklist_hg19
)
參考來源:https://satijalab.org/signac/articles/pbmc_vignette.html