隨著單細(xì)胞測序技術(shù)的成熟,越來越多的研究者選擇應(yīng)用該技術(shù)來闡釋手上的生物學(xué)問題。同時(shí)單細(xì)胞也不再是單樣本單物種單器官的技術(shù)胚嘲,往往會(huì)用到多樣本整合分析的技術(shù),這方面Seurat團(tuán)隊(duì)是最值得關(guān)注的洛二。他們提出了一套用于單細(xì)胞樣本整合分析的算法馋劈,Comprehensive integration of single cell data,并打包成Rpackages可以說是很貼心了晾嘶。
其整合單細(xì)胞數(shù)據(jù)的核心函數(shù)之一就是:FindIntegrationAnchors {Seurat}
如其名稱所指妓雾,是用來Finds the integration anchors的,他是如何做到的呢垒迂?
?FindIntegrationAnchors
但是在整合的時(shí)候有時(shí)候整合不成功械姻,特別是應(yīng)用之前的單細(xì)胞技術(shù),識(shí)別的細(xì)胞較少的時(shí)候娇斑,如可能會(huì)報(bào)錯(cuò):
Filtering Anchors
Error in nn2(data = cn.data2[nn.cells2, ], query = cn.data1[nn.cells1, :
Cannot find more nearest neighbours than there are points
github上有這個(gè)問題的討論:https://github.com/satijalab/seurat/issues/997
有用的回答是這樣的:
I experienced the same problem when integrating very heterogeneous datasets, and it seems to be related to the size of k.filter
parameter.In my case, lowering the default value of 200 to 150 did the job. However, I am still trying to figure out whether the results are meaningful, and the underlying biological rationale of it, so any explanation would be welcome!
那么k.filter
的影響到底是多大呢策添?我們來試一下。
我用Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq這篇文章的數(shù)據(jù)集:GSE118390.
library(R.utils)
library(tidyverse)
library(reticulate)
use_python("E:\\conda\\python.exe")
gunzip("D:\\Users\\Administrator\\Desktop\\RStudio\\con-cluster\\GSE118389_counts_rsem.txt.gz", remove=FALSE)
medata<-read.table("D:\\Users\\Administrator\\Desktop\\RStudio\\con-cluster\\GSE118389_counts_rsem.txt")
dim(medata)
table(str_split(colnames(medata),pattern="_",simplify=T)[,1])
PT039 PT058 PT081 PT084 PT089 PT126
341 96 288 286 333 190
Seurat基本流程:
library(Seurat)
pbmc <- CreateSeuratObject(counts = medata, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- ScaleData(pbmc,features=VariableFeatures(pbmc))
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
pbmc <- RunTSNE(pbmc, dims = 1:10)
?DimPlot
table(pbmc@meta.data$orig.ident)
PT039 PT058 PT081 PT089
127 58 57 237
DimPlot(pbmc, reduction = "umap",label = TRUE,group.by="orig.ident")
這就是沒有整合的時(shí)候毫缆,四個(gè)樣本的降維結(jié)果唯竹。
下面我們用不同的k.filter
參數(shù)來試一下,
pancreas.list <- SplitObject(pbmc, split.by = "orig.ident")
pancreas.list <- pancreas.list[c("PT039", "PT058", "PT081", "PT089")]
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst",
nfeatures = 2000, verbose = FALSE)
}
#reference.list <- pancreas.list[c("PT081", "PT089")]
SelectIntegrationFeatures(object.list = pancreas.list)
#?FindIntegrationAnchors
pancreas.anchors <- FindIntegrationAnchors(object.list = pancreas.list, dims = 1:20,k.anchor = 5,k.filter = 30)
#?IntegrateData
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:20)
Merging dataset 3 into 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 2 into 1 3
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 4 into 1 3 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
pancreas.integrated <- ScaleData(pancreas.integrated,features=VariableFeatures(pbmc))
pancreas.integrated <- RunPCA(pancreas.integrated, features = VariableFeatures(object = pbmc))
pancreas.integrated <- FindNeighbors(pancreas.integrated, dims = 1:10)
pancreas.integrated <- FindClusters(pancreas.integrated, resolution = 0.5)
pancreas.integrated <- RunUMAP(pancreas.integrated, dims = 1:10)
pancreas.integrated <- RunTSNE(pancreas.integrated, dims = 1:10)
DimPlot(pancreas.integrated, reduction = "umap",label = TRUE,group.by="orig.ident")+ggtitle("k.filter = 30")
可見苦丁,seurat在整合多樣本的時(shí)候并不會(huì)自動(dòng)為研究者提供合適的參數(shù)浸颓,我們也不應(yīng)這樣要求他們。需要注意的是default
雖然是用的最多的旺拉,并不一定是最優(yōu)的产上。
還有一種方式merge()即簡單地講多個(gè)數(shù)據(jù)集放到一起,并不運(yùn)行整合分析蛾狗。
pancreas.list <- SplitObject(pbmc, split.by = "orig.ident")
pancreas.list <- pancreas.list[c("PT039", "PT058", "PT081", "PT089")]
mpancreas.list<-pancreas.list[c("PT039", "PT058", "PT081")]
mymerg<-merge(pancreas.list$PT089,mpancreas.list)
mymerg <- NormalizeData(mymerg, normalization.method = "LogNormalize", scale.factor = 10000)
mymerg <- FindVariableFeatures(mymerg, selection.method = "vst", nfeatures = 2000)
mymerg <- ScaleData(mymerg,features=VariableFeatures(mymerg))
mymerg <- RunPCA(mymerg, features = VariableFeatures(object = mymerg))
mymerg <- FindNeighbors(mymerg, dims = 1:10)
mymerg <- FindClusters(mymerg, resolution = 0.5)
mymerg <- RunUMAP(mymerg, dims = 1:10)
mymerg <- RunTSNE(mymerg, dims = 1:10)
pm<-DimPlot(mymerg, reduction = "umap",label = TRUE,group.by="orig.ident")+ggtitle("merge")
CombinePlots(plots = list(p0,pm),legend="bottom")