前面也說(shuō)過(guò)用CCA,SCTransform和Harmony等可以消除多個(gè)數(shù)據(jù)的批次效應(yīng)荸型,來(lái)實(shí)現(xiàn)多組數(shù)據(jù)的整合盹靴。但有時(shí)候我們會(huì)遇到兩個(gè)datasets只有部分重疊,這和之前介紹的方法就有一點(diǎn)不同了。特別是稿静,在標(biāo)準(zhǔn)工作流下梭冠,識(shí)別存在于多個(gè)數(shù)據(jù)集中的基因可能存在問題。Seurat v4 包括一組方法自赔,以匹配(或"對(duì)齊")跨數(shù)據(jù)集共同的基因妈嘹。這些方法首先識(shí)別處于匹配生物狀態(tài)的交叉數(shù)據(jù)集對(duì)("錨點(diǎn)")柳琢,既可用于糾正數(shù)據(jù)集之間的技術(shù)差異(即批量效應(yīng)校正)绍妨,又可用于對(duì)整個(gè)實(shí)驗(yàn)條件進(jìn)行比較scRNA-seq分析,當(dāng)然也可以實(shí)現(xiàn)跨物種的整合柬脸,例如我們利用跨物種的同源基因他去。
原理
Seurat整合數(shù)據(jù)集的過(guò)程首先是對(duì)不同的數(shù)據(jù)集進(jìn)行降維,在低維空間中尋找成對(duì)的anchors倒堕,然后再以anchor作為錨點(diǎn)灾测,對(duì)其他的點(diǎn)的坐標(biāo)進(jìn)行調(diào)整,從而將數(shù)據(jù)集整合在一起垦巴。
Integration
官方教程中媳搪,在進(jìn)行數(shù)據(jù)整合之前,先將所有的數(shù)據(jù)整合在一起骤宣,再split成一個(gè)object list秦爆,在object list中又進(jìn)行了標(biāo)準(zhǔn)化等操作。真正進(jìn)行整合的核心代碼其實(shí)是FindIntegationAnchors以及IntegrateData兩個(gè)函數(shù)來(lái)實(shí)現(xiàn)的憔披。
下面等限,我們以在ctrl或干擾素刺激狀態(tài)下對(duì)人體免疫細(xì)胞 (PBMC) 進(jìn)行比較分析。
library(Seurat)
library(SeuratData)
library(patchwork)
install.packages("ifnb.SeuratData")? #我是直接下載下來(lái)自己安裝的測(cè)試數(shù)據(jù)芬膝,也可以通過(guò)命令安裝:InstallData("ifnb")
library(ifnb.SeuratData)
LoadData("ifnb")
> ifnb
An object of class Seurat
14053 features across 13999 samples within 1 assay
Active assay: RNA (14053 features, 0 variable features)
#把數(shù)據(jù)集分成2個(gè) (stim和CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")
#對(duì)每個(gè)數(shù)據(jù)集進(jìn)行標(biāo)準(zhǔn)化望门,提取細(xì)胞間變異系數(shù)較大的top 2000個(gè)基因
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
? ? x <- NormalizeData(x)
? ? x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
# select features that are repeatedly variable across datasets for integration,也就是選擇兩個(gè)數(shù)據(jù)集之間高變異基因的交集
features <- SelectIntegrationFeatures(object.list = ifnb.list)
執(zhí)行整合
然后锰霜,我們使用該功能識(shí)別錨點(diǎn)筹误,以 Seurat 對(duì)象列表作為輸入,并使用這些錨點(diǎn)將兩個(gè)數(shù)據(jù)集集成在一起癣缅。其實(shí)纫事,這個(gè)工作真正進(jìn)行整合的核心代碼是FindIntegationAnchors以及IntegrateData兩個(gè)函數(shù)來(lái)實(shí)現(xiàn)的。
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
immune.combined <- IntegrateData(anchorset = immune.anchors)
FindIntegrationAnchors:
這個(gè)函數(shù)主要完成以下幾個(gè)步驟:
1. 降維
對(duì)list中的數(shù)據(jù)集分別進(jìn)行降維所灸,以便下一步在低維空間中尋找細(xì)胞的nearest neighbors丽惶。這一步可以選擇不同的方法,包括cca(canonical correlation analysis) rpca(reciprocal pca) rlsi三種爬立,可以用reduction進(jìn)行設(shè)置钾唬,默認(rèn)方法為cca。
cca是針對(duì)兩個(gè)不同數(shù)據(jù)集,尋找兩個(gè)數(shù)據(jù)集之間相關(guān)性最大的成分的分析方法抡秆。
2. 尋找anchors
在低維空間中奕巍,對(duì)于一個(gè)數(shù)據(jù)集中的每個(gè)細(xì)胞,尋找它在另一個(gè)數(shù)據(jù)集中距離最近的k個(gè)細(xì)胞(k nearest neighbors)儒士。如果X數(shù)據(jù)集中的細(xì)胞a在Y數(shù)據(jù)集中找到了b1 b2 … bk這幾個(gè)kNN的止,而Y數(shù)據(jù)集中的b1細(xì)胞在X數(shù)據(jù)集中找到的kNN又恰巧包括a,那么a和b1就是一對(duì)nearest neighbor着撩,又稱為mutual nearest neighbor(MNN)诅福。在這一步中,可以設(shè)置的參數(shù)為k.anchor拖叙,就是在相對(duì)數(shù)據(jù)集中尋找到的neighbor的個(gè)數(shù)氓润,默認(rèn)值為5.
3. 過(guò)濾anchors
這一步將在未降維的原始數(shù)據(jù)集中,提取與CCV(canonical correlation vectors)相關(guān)性最高的n個(gè)features薯鳍。對(duì)其進(jìn)行normalization咖气,然后在這些feaure構(gòu)成的空間中,對(duì)上一步找到的nearest neighbor pairs進(jìn)行篩選挖滤。比如崩溪,若上一步找到的a和b1這個(gè)pair,在這個(gè)高位空間中斩松,a的前k個(gè)nearest neighbor中不包含b1伶唯,那么這個(gè)nearest neighbor pair將會(huì)被過(guò)濾掉,不會(huì)再被使用砸民。在這一步中抵怎,可以自行設(shè)置的參數(shù)包括max.features,即選取多少個(gè)最相關(guān)基因構(gòu)成高維空間岭参,默認(rèn)為200反惕;k.filter,即在相對(duì)數(shù)據(jù)集中尋找nearest neighbor的個(gè)數(shù)演侯,默認(rèn)為200姿染。需要注意的是,這一步尋找NN的范圍設(shè)置較寬秒际,因?yàn)椴皇菫榱藢ふ倚碌膒air悬赏,而是為了過(guò)濾掉在上一步中找到的不可靠的Pair。
4. 對(duì)anchors進(jìn)行打分
如果X數(shù)據(jù)集的細(xì)胞a與Y數(shù)據(jù)集中的細(xì)胞b相似娄徊,那么細(xì)胞a應(yīng)該也與Y中與b相似的那些細(xì)胞相似闽颇,細(xì)胞b應(yīng)該也與X中與a相似的那些細(xì)胞相似〖娜瘢基于這種原理對(duì)anchor進(jìn)行打分兵多。對(duì)于一個(gè)X Y數(shù)據(jù)及中的一對(duì)anchor pair a和b尖啡,在X數(shù)據(jù)及中找到與a最相似的k個(gè)細(xì)胞,同時(shí)在Y數(shù)據(jù)集中也找到與其最相似的k個(gè)細(xì)胞剩膘,同時(shí)也對(duì)b做相同的事情衅斩。然后計(jì)算他們又多少neighbor是共享的,即shared nearest neighbors(SNN)怠褐。根據(jù)SNN的比例對(duì)這個(gè)anchor pair進(jìn)行打分畏梆。SNN越多,打分越高奈懒,意味著這個(gè)paire越可靠奠涌,在后續(xù)的整合中這個(gè)pair就會(huì)有更高的權(quán)重。涉及這一步的參數(shù)是k.weight筐赔,即在不同數(shù)據(jù)集中尋找的neighbor的個(gè)數(shù)铣猩,默認(rèn)為30揖铜。
經(jīng)過(guò)這幾步茴丰,F(xiàn)indIntegrationAnchors就會(huì)找到兩個(gè)數(shù)據(jù)集之間的較為可靠的anchors,并且對(duì)可靠性有一個(gè)評(píng)估天吓。
IntegrateData:
這個(gè)函數(shù)主要完成對(duì)于每個(gè)細(xì)胞校準(zhǔn)贿肩。
對(duì)于每個(gè)細(xì)胞,找到與其距離最近的k個(gè)anchors龄寞,然后根據(jù)細(xì)胞與anchor的距離汰规,以及上一步對(duì)anchor可靠性的打分,決定每個(gè)anchors對(duì)這個(gè)細(xì)胞的權(quán)重物邑。k值由k.weight設(shè)置溜哮,默認(rèn)值為100。然后利用anchors的坐標(biāo)以及不同anchors所占的權(quán)重色解,對(duì)每個(gè)細(xì)胞的表達(dá)數(shù)據(jù)進(jìn)行校準(zhǔn)茂嗓。
IntegrateData返回的是與SCTransform或NormalizeData相同格式的數(shù)據(jù)slot。如果normalization.method設(shè)置為L(zhǎng)ogNormalize科阎,則返回?cái)?shù)據(jù)為normalized slot述吸,儲(chǔ)存在@data中,如果設(shè)置為SCT锣笨,則返回?cái)?shù)據(jù)為完成中心化的數(shù)據(jù)蝌矛,儲(chǔ)存在@scale.data中错英。
執(zhí)行整合分析:
下面入撒,我們可以對(duì)所有細(xì)胞進(jìn)行單次整合分析,像以前的seurat一樣噪伊。
# unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.combined) <- "integrated"
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
p1 + p2
DimPlot(immune.combined, reduction = "umap", split.by = "stim")
鑒定在不同的組或數(shù)據(jù)集里面保守的細(xì)胞標(biāo)記基因:
在以前的seurat分析中氮唯,我們是利用FindMarkers等來(lái)鑒定不同cluster之間的細(xì)胞標(biāo)記基因鉴吹,現(xiàn)在我們可以利用FindConservedMarkers鑒定特定cluster在不同組之間都存在的,也就是保守的marker基因惩琉。例如:我們可以計(jì)算出無(wú)論刺激條件如何豆励,第6組(NK細(xì)胞)中保守標(biāo)記的基因。
DefaultAssay(immune.combined) <- "RNA"
nk.markers <- FindConservedMarkers(immune.combined, ident.1 = 6, grouping.var = "stim", verbose = FALSE)
head(nk.markers)
從結(jié)果可以看出瞒渠,前半部分和普通的seurat的分析類似良蒸,只不過(guò)是分了2組而已,在2個(gè)組中都是marker基因伍玖。
畫基因的表達(dá)圖嫩痰,也和前面講seurat類似:
FeaturePlot(immune.combined, features = c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A", "CCL2", "PPBP"), min.cutoff = "q9")
immune.combined <- RenameIdents(immune.combined, '0' = "CD14 Mono", '1' = "CD4 Naive T", '2' = "CD4 Memory T",'3' = "CD16 Mono", '4' = "B", '5' = "CD8 T", '6' = "NK", '7' = "T activated", '8' = "DC", '9' = "B Activated",'10' = "Mk", '11' = "pDC", '12' = "Eryth", '13' = "Mono/Mk Doublets", '14' = "HSPC")
DimPlot(immune.combined, label = TRUE)
Idents(immune.combined) <- factor(Idents(immune.combined), levels = c("HSPC", "Mono/Mk Doublets", "pDC", "Eryth", "Mk", "DC", "CD14 Mono", "CD16 Mono", "B Activated", "B", "CD8 T", "NK", "T activated", "CD4 Naive T", "CD4 Memory T"))
markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", "NKG7", "CCL5",
? ? "CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", "VMO1", "CCL2", "S100A9", "HLA-DQA1",
? ? "GPR183", "PPBP", "GNG11", "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ", "PRSS57")
DotPlot(immune.combined, features = markers.to.plot, cols = c("blue", "red"), dot.scale = 8, split.by = "stim") +
? ? RotatedAxis()
兩組數(shù)據(jù)對(duì)比,識(shí)別差異基因:
因?yàn)槲覀冇?組數(shù)據(jù):STIM和CTRL窍箍,也就是刺激和對(duì)照細(xì)胞串纺,我們可以開始做比較分析,看看刺激引起的差異椰棘。觀察這些變化的一種方法是繪制受刺激細(xì)胞和對(duì)照細(xì)胞的平均表達(dá)纺棺,并尋找在散點(diǎn)圖上離群的基因。在這里邪狞,我們采取刺激和對(duì)照組的幼稚T細(xì)胞和CD14+單核細(xì)胞群的平均表達(dá)祷蝌,并產(chǎn)生散點(diǎn)圖,突出顯示干擾素刺激劇烈反應(yīng)的基因帆卓。
library(ggplot2)
library(cowplot)
theme_set(theme_cowplot())
t.cells <- subset(immune.combined, idents = "CD4 Naive T")
Idents(t.cells) <- "stim"
avg.t.cells <- as.data.frame(log1p(AverageExpression(t.cells, verbose = FALSE)$RNA))
avg.t.cells$gene <- rownames(avg.t.cells)
cd14.mono <- subset(immune.combined, idents = "CD14 Mono")
Idents(cd14.mono) <- "stim"
avg.cd14.mono <- as.data.frame(log1p(AverageExpression(cd14.mono, verbose = FALSE)$RNA))
avg.cd14.mono$gene <- rownames(avg.cd14.mono)
genes.to.label = c("ISG15", "LY6E", "IFI6", "ISG20", "MX1", "IFIT2", "IFIT1", "CXCL10", "CCL8")
p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) + geom_point() + ggtitle("CD4 Naive T Cells")
p1 <- LabelPoints(plot = p1, points = genes.to.label, repel = TRUE)
p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) + geom_point() + ggtitle("CD14 Monocytes")
p2 <- LabelPoints(plot = p2, points = genes.to.label, repel = TRUE)
p1 + p2
這樣巨朦,我們可以清晰的發(fā)現(xiàn)那些基因在細(xì)胞水平有明顯的差異。
我們當(dāng)然剑令,也可以挑選相同細(xì)胞類型在不同組中顯著差異的細(xì)胞糊啡。還是利用Seurat中常用的FindMarkers函數(shù)。只不過(guò)要把細(xì)胞類型重新命名尚洽,加上組別的信息悔橄。
immune.combined$celltype.stim <- paste(Idents(immune.combined), immune.combined$stim, sep = "_")
immune.combined$celltype <- Idents(immune.combined)
Idents(immune.combined) <- "celltype.stim"
b.interferon.response <- FindMarkers(immune.combined, ident.1 = "B_STIM", ident.2 = "B_CTRL", verbose = FALSE)
head(b.interferon.response, n = 15)
我們也可以對(duì)于給定基因列表的特征圖,按分組變量(此處為刺激對(duì)照)進(jìn)行拆分腺毫。
FeaturePlot(immune.combined, features = c("CD3D", "GNLY", "IFI6"), split.by = "stim", max.cutoff = 3,cols = c("grey", "red"))
plots <- VlnPlot(immune.combined, features = c("LYZ", "ISG15", "CXCL10"), split.by = "stim", group.by = "celltype",pt.size = 0, combine = FALSE)
wrap_plots(plots = plots, ncol = 1)
當(dāng)然癣疟,網(wǎng)上還建議了另外一個(gè)改進(jìn)版的整合功能,是使用sctransform來(lái)整合數(shù)據(jù)
有幾個(gè)關(guān)鍵差異:
1. 通過(guò)SCTransform()單獨(dú)(而不是在整合之前)實(shí)現(xiàn)數(shù)據(jù)集標(biāo)準(zhǔn)化潮酒;
2. 在識(shí)別錨點(diǎn)之前運(yùn)行:[PrepSCTIntegration()]
3. 運(yùn)行[FindIntegrationAnchors()]和[IntegrateData()]時(shí)睛挚,設(shè)置參數(shù)到normalization.method為SCT
4. 當(dāng)運(yùn)行基于 sctransform 的工作流(包括整合)時(shí),不運(yùn)行該功能[ScaleData()]
大致流程如下:(但是具體那種好急黎,我現(xiàn)在也不是很清楚)
LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
? ? anchor.features = features)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction = "pca", dims = 1:30)
p1 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "seurat_annotations", label = TRUE,repel = TRUE)
p1 + p2
我們后面會(huì)用自己的數(shù)據(jù)集試一下看看扎狱。