Seurat包學(xué)習(xí)筆記（六）：scATAC-seq + scRNA-seq integration

在本教程中，我們將學(xué)習(xí)使用Seurat3對(duì)scATAC-seq和scRNA-seq的數(shù)據(jù)進(jìn)行整合分析格嗅，使用一種新的數(shù)據(jù)轉(zhuǎn)移映射方法，將scATAC-seq的數(shù)據(jù)基于scRNA-seq數(shù)據(jù)聚類的結(jié)果進(jìn)行細(xì)胞分群汉匙，并進(jìn)行整合分析涤姊。
對(duì)于scATAC-seq，我們還開(kāi)發(fā)了一個(gè)獨(dú)立的Signac包霜定，可以用于單細(xì)胞ATAC-seq的數(shù)據(jù)分析及其整合分析档悠，詳細(xì)的使用方法可以參考Signac包官網(wǎng)的說(shuō)明文檔。

image

本示例中望浩，我們將使用10X genomics測(cè)序平臺(tái)產(chǎn)生的PBMC數(shù)據(jù)集（包含大約10K個(gè)細(xì)胞的scRNA-seq和scATAC-seq的測(cè)序數(shù)據(jù)）進(jìn)行分析演示辖所，整體的分析流程主要包括以下幾個(gè)步驟：

Estimate RNA-seq levels from ATAC-seq (quantify gene expression ‘a(chǎn)ctivity’ from ATAC-seq reads)
Learn the internal structure of the ATAC-seq data on its own (accomplished using LSI)
Identify ‘a(chǎn)nchors’ between the ATAC-seq and RNA-seq datasets
Transfer data between datasets (either transfer labels for classification, or impute RNA levels in the ATAC-seq data to enable co-embedding)

基因表達(dá)活性的定量

首先，我們加載scATAC-seq數(shù)據(jù)產(chǎn)生的一個(gè)peak matrix磨德，并將其整合合并成一個(gè)gene activity matrix缘回。我們基于這樣一個(gè)簡(jiǎn)單的假設(shè)：基因的表達(dá)活性可以簡(jiǎn)單的通過(guò)基因上下游2kb范圍內(nèi)覆蓋的reads數(shù)的加和進(jìn)行定量，最后獲得一個(gè)gene x cell的表達(dá)矩陣典挑。

# 加載所需的R包
library(Seurat)
library(ggplot2)
library(patchwork)
# 加載peak矩陣文件
peaks <- Read10X_h5("../data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")

# create a gene activity matrix from the peak matrix and GTF, using chromosomes 1:22, X, and Y.
# Peaks that fall within gene bodies, or 2kb upstream of a gene, are considered
# 使用CreateGeneActivityMatrix函數(shù)構(gòu)建基因表達(dá)活性矩陣
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "../data/Homo_sapiens.GRCh37.82.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)

構(gòu)建Seurat對(duì)象

接下來(lái)酥宴，我們創(chuàng)建一個(gè)seurat對(duì)象存儲(chǔ)scATAC-seq的數(shù)據(jù)，其中原始的peak counts存儲(chǔ)在"ATAC"這個(gè)Assay中您觉，基因表達(dá)活性的矩陣存儲(chǔ)在"RNA"這個(gè)Assay中拙寡。對(duì)于原始的count，我們將進(jìn)行QC過(guò)濾掉那些總counts數(shù)低于5k的細(xì)胞琳水。

# 創(chuàng)建seurat對(duì)象
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
# meta信息
meta <- read.table("../data/atac_v1_pbmc_10k_singlecell.csv", sep = ",", header = TRUE, row.names = 1, stringsAsFactors = FALSE)
meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
# 數(shù)據(jù)過(guò)濾
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 5000)
pbmc.atac$tech <- "atac"

數(shù)據(jù)預(yù)處理

# 對(duì)基因表達(dá)活性矩陣進(jìn)行標(biāo)準(zhǔn)化
DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- FindVariableFeatures(pbmc.atac)
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac)

Here we perform latent semantic indexing to reduce the dimensionality of the scATAC-seq data. This procedure learns an ‘internal’ structure for the scRNA-seq data, and is important when determining the appropriate weights for the anchors when transferring information. We utilize Latent Semantic Indexing (LSI) to learn the structure of ATAC-seq data, as proposed in Cusanovich et al, Science 2015 and implemented in the RunLSI function. LSI is implemented here by performing computing the term frequency-inverse document frequency (TF-IDF) followed by SVD.

# 對(duì)原始的peak矩陣進(jìn)行降維處理
DefaultAssay(pbmc.atac) <- "ATAC"
VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 100))
# 使用LSI方法進(jìn)行降維
pbmc.atac <- RunLSI(pbmc.atac, n = 50, scale.max = NULL)
pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50)

# 加載scRNA-seq數(shù)據(jù)
pbmc.rna <- readRDS("../data/pbmc_10k_v3.rds")
pbmc.rna$tech <- "rna"

# 數(shù)據(jù)可視化
p1 <- DimPlot(pbmc.atac, reduction = "umap") + NoLegend() + ggtitle("scATAC-seq")
p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq")
p1 + p2

image

接下來(lái)肆糕，我們將對(duì)scATAC-seq和scRNA-seq的數(shù)據(jù)進(jìn)行整合分析般堆。

# 識(shí)別整合的anchors，使用scRNA-seq的數(shù)據(jù)作為參照
transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna), reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")
# 使用TransferData函數(shù)進(jìn)行數(shù)據(jù)轉(zhuǎn)移映射
celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$celltype, weight.reduction = pbmc.atac[["lsi"]])
# 添加預(yù)測(cè)信息
pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)

然后擎宝，我們可以檢查預(yù)測(cè)分?jǐn)?shù)的分布郁妈，并可以選擇過(guò)濾掉那些得分較低的細(xì)胞。在這里绍申，我們發(fā)現(xiàn)超過(guò)95％的細(xì)胞得分為0.5或更高噩咪。

hist(pbmc.atac$prediction.score.max)
abline(v = 0.5, col = "red")
table(pbmc.atac$prediction.score.max > 0.5)
## 
## FALSE  TRUE 
##   290  7576

image

同樣的，我們可以通過(guò)UMAP降維可視化scATAC-seq數(shù)據(jù)預(yù)測(cè)的細(xì)胞類型极阅，并發(fā)現(xiàn)所轉(zhuǎn)移的標(biāo)簽與scRNA-seq數(shù)據(jù)UMAP結(jié)果的結(jié)構(gòu)高度一致胃碾。

# 數(shù)據(jù)過(guò)濾
pbmc.atac.filtered <- subset(pbmc.atac, subset = prediction.score.max > 0.5)
pbmc.atac.filtered$predicted.id <- factor(pbmc.atac.filtered$predicted.id, levels = levels(pbmc.rna))  # to make the colors match
# 數(shù)據(jù)可視化
p1 <- DimPlot(pbmc.atac.filtered, group.by = "predicted.id", label = TRUE, repel = TRUE) + ggtitle("scATAC-seq cells") + NoLegend() + scale_colour_hue(drop = FALSE)
p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + ggtitle("scRNA-seq cells") + NoLegend()
p1 + p2

image

共嵌入聯(lián)合可視化分析

最后，為了將所有的細(xì)胞一起可視化筋搏，我們可以將scRNA-seq和scATAC-seq細(xì)胞共同嵌入到同一低維空間中仆百。在這里，我們首先使用先前識(shí)別出的相同anchors來(lái)轉(zhuǎn)移映射細(xì)胞類型標(biāo)簽奔脐，以估算填充scATAC-seq細(xì)胞的RNA-seq值俄周。然后，我們合并測(cè)得的和估算填充的scRNA-seq數(shù)據(jù)髓迎，并運(yùn)行標(biāo)準(zhǔn)UMAP進(jìn)行降維可視化所有的細(xì)胞峦朗。請(qǐng)注意，此步驟僅用于可視化目的排龄，不是數(shù)據(jù)轉(zhuǎn)移映射分析的必要部分波势。

# note that we restrict the imputation to variable genes from scRNA-seq, but could impute the full transcriptome if we wanted to
genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]

# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells.  imputation (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]])

# this line adds the imputed data matrix to the pbmc.atac object
pbmc.atac[["RNA"]] <- imputation
# 進(jìn)行merge合并
coembed <- merge(x = pbmc.rna, y = pbmc.atac)

# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)
coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
coembed <- RunUMAP(coembed, dims = 1:30)
coembed$celltype <- ifelse(!is.na(coembed$celltype), coembed$celltype, coembed$predicted.id)

Here we plot all cells colored by either their assigned cell type (from the 10K dataset) or their predicted cell type from the data transfer procedure.

p1 <- DimPlot(coembed, group.by = "tech")
p2 <- DimPlot(coembed, group.by = "celltype", label = TRUE, repel = TRUE)
p1 + p2

image

How can I further investigate unmatched groups of cells that only appear in one assay?

在查看UMAP嵌和數(shù)據(jù)可視化的結(jié)果中，我們發(fā)現(xiàn)有幾種細(xì)胞似乎僅在一種測(cè)序類型數(shù)據(jù)中存在橄维。例如尺铣，血小板細(xì)胞僅出現(xiàn)在scRNA-seq數(shù)據(jù)中。這些細(xì)胞被認(rèn)為正在經(jīng)歷從巨核細(xì)胞到血小板細(xì)胞分化的過(guò)程中争舞，因此要么完全缺乏核物質(zhì)凛忿，要么它們的染色質(zhì)狀態(tài)與其轉(zhuǎn)錄的過(guò)程不一致。因此竞川，我們不能期望它們?cè)诖朔治鲋袝?huì)保持一致店溢。

DimPlot(coembed, split.by = "tech", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()

image

此外，我們還發(fā)現(xiàn)在B細(xì)胞祖細(xì)胞旁邊似乎也有一個(gè)細(xì)胞類群流译，該類群只存在于scATAC-seq數(shù)據(jù)中逞怨，并且與scRNA-seq數(shù)據(jù)的細(xì)胞整合得不好者疤。對(duì)這些細(xì)胞的元數(shù)據(jù)信息進(jìn)一步檢查發(fā)現(xiàn)福澡，大量的reads映射到black list區(qū)域（由10x Genomics QC指標(biāo)提供）。這表明這些細(xì)胞可能代表了死亡或垂死的細(xì)胞驹马，環(huán)境中的DNA或其他人工的污染革砸，導(dǎo)致其不存在于scRNA-seq數(shù)據(jù)集中除秀。

coembed$blacklist_region_fragments[is.na(coembed$blacklist_region_fragments)] <- 0
FeaturePlot(coembed, features = "blacklist_region_fragments", max.cutoff = 500)

image

How can I evaluate the success of cross-modality integration?

Overall, the prediction scores are high which suggest a high degree of confidence in our cell type assignments.
We observe good agreement between the scATAC-seq only dimensional reduction (UMAP plot above) and the transferred labels.
The co-embedding based on the same set of anchors gives good mixing between the two modalities.
When we collapse the ATAC-seq reads on a per-cluster basis into “pseudo bulk” profiles, we observe good concordance with the chromatin patterns observed in bulk data

參考來(lái)源：https://satijalab.org/seurat/v3.1/atacseq_integration_vignette.html

最后編輯于：2020.09.19 18:42:44

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市算利，隨后出現(xiàn)的幾起案子册踩，更是在濱河造成了極大的恐慌，老刑警劉巖效拭，帶你破解...
沈念sama閱讀 216,324評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件暂吉，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡缎患，警方通過(guò)查閱死者的電腦和手機(jī)慕的，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,356評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)挤渔，“玉大人肮街，你說(shuō)我怎么就攤上這事∨械迹” “怎么了嫉父？”我有些...
開(kāi)封第一講書(shū)人閱讀 162,328評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)眼刃。經(jīng)常有香客問(wèn)我绕辖，道長(zhǎng)，這世上最難降的妖魔是什么鸟整？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,147評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任引镊，我火速辦了婚禮，結(jié)果婚禮上篮条，老公的妹妹穿的比我還像新娘弟头。我一直安慰自己，他們只是感情好涉茧，可當(dāng)我...
茶點(diǎn)故事閱讀 67,160評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布赴恨。她就那樣靜靜地躺著，像睡著了一般伴栓。火紅的嫁衣襯著肌膚如雪伦连。梳的紋絲不亂的頭發(fā)上，一...
開(kāi)封第一講書(shū)人閱讀 51,115評(píng)論 1贊 296
城市分裂傳說(shuō)
那天钳垮，我揣著相機(jī)與錄音惑淳，去河邊找鬼。笑死饺窿，一個(gè)胖子當(dāng)著我的面吹牛歧焦，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播肚医，決...
沈念sama閱讀 40,025評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼绢馍，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼向瓷！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起舰涌，我...
開(kāi)封第一講書(shū)人閱讀 38,867評(píng)論 0贊 274
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤猖任，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后瓷耙，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體朱躺，經(jīng)...
沈念sama閱讀 45,307評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,528評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年搁痛，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了室琢。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 39,688評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡落追，死狀恐怖盈滴，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情轿钠，我是刑警寧澤巢钓，帶...
沈念sama閱讀 35,409評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站疗垛，受9級(jí)特大地震影響症汹，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜贷腕，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,001評(píng)論 3贊 325
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一背镇、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧泽裳，春花似錦瞒斩、人聲如沸。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 31,657評(píng)論 0贊 22
一樁弒父案胸囱，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至瀑梗，卻和暖如春烹笔，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背抛丽。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 32,811評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工谤职，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人亿鲜。一個(gè)月前我還...
沈念sama閱讀 47,685評(píng)論 2贊 368
代替公主和親
正文我出身青樓允蜈，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子陷寝，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,573評(píng)論 2贊 353

Seurat包學(xué)習(xí)筆記（六）：scATAC-seq + scRNA-seq integration

基因表達(dá)活性的定量

構(gòu)建Seurat對(duì)象

數(shù)據(jù)預(yù)處理

共嵌入聯(lián)合可視化分析

How can I further investigate unmatched groups of cells that only appear in one assay?

How can I evaluate the success of cross-modality integration?

推薦閱讀更多精彩內(nèi)容