適用背景
單細胞數(shù)據(jù)由于實驗平臺或樣本等原因會造成不同數(shù)據(jù)集之間存在批次效應球切,這種批次效應是人為因素造成的制妄,沒有實際的生物學意義涌萤,可能對研究結果產(chǎn)生極大影響坠狡。批次效應可由聚類結果確定继找,如果聚類出的某些亞群絕大部分都來自同一個樣本一般就認為存在批次效應,因此需要進行批次校正逃沿。
本文總結了三種常用的整合方法代碼:CCA婴渡,SCTransform和Harmony。
方法一
CCA整合方法是目前應用最多方法凯亮,是Seurat自帶的边臼,大多數(shù)情況以及夠用了,效果也還可以假消,但是對于較大數(shù)據(jù)集柠并,耗時較長,占內(nèi)存也較大置谦。目前堂鲤,Seurat官網(wǎng)在此基礎上推薦reference-based,也就是指定參考數(shù)據(jù)集進行整合媒峡,但對于自產(chǎn)數(shù)據(jù)集瘟栖,一般根本無法預先知道哪個樣本效果最好,這種reference-based的思路更適合數(shù)據(jù)挖掘類的研究谅阿。
參數(shù)簡介
- obj半哟,Seurat對象
- group.by,整合分組
- mt.pattern或mt.list签餐,指定線粒體基因寓涨,mt.pattern支持模糊搜索,mt.list直接給定基因集氯檐,格式為向量
- dim.use PCA主成分的選擇個數(shù)
- mt.cutoff 線粒體百分比閾值
- nf.low 基因數(shù)下限
- nf.high 基因數(shù)上限
- nfeatures戒良,用于整合的高變基因選擇數(shù)
- res,聚類亞群的分辨率
seurat_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=30,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(all))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
all.list <- SplitObject(all, split.by = group.by)
for (i in 1:length(all.list)){
all.list[[i]] <- NormalizeData(all.list[[i]], verbose = FALSE)
all.list[[i]] <- FindVariableFeatures(all.list[[i]], selection.method = "vst",nfeatures = nfeatures, verbose = FALSE)}
reference.list <- all.list
all.anchors <- FindIntegrationAnchors(object.list = reference.list, dims =1:dim.use)
all.integrated <- IntegrateData(anchorset = all.anchors, dims = 1:dim.use)
DefaultAssay(all.integrated) <- "integrated"
all.integrated <- ScaleData(all.integrated, verbose = FALSE)
npcs <- dim.use+10
all.integrated <- RunPCA(all.integrated, npcs = npcs, verbose = FALSE)
all.integrated <- FindNeighbors(all.integrated, reduction = "pca", dims = 1:dim.use)
all.integrated <- FindClusters(all.integrated, resolution = res)
all.integrated <- RunUMAP(all.integrated, reduction = "pca", dims = 1:dim.use)
return(all.integrated)
}
方法二
第二種方法是Seurat官網(wǎng)極度推薦的冠摄,主要由于方法一的Normalization and variance stabilization流程存在一定問題糯崎,會造成基因表達量會與測序深度存在明顯的相關關系等几缭,因此提出了SCTransform進行預處理,然后再整合沃呢,其實后面的整合方法跟方法一的類型年栓,只不過這里的前期預處理用的是SCTransform,而方法一用的是LogNormalize薄霜,因此整合的對象結構是類似的某抓。詳細內(nèi)容可閱讀這篇文獻。這種方法理論上更為合理惰瓜,但是也更耗運行內(nèi)存和運行時間否副。參數(shù)與方法一一致。
sct_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=30,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(obj))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
obj <- all
ifnb.list <- SplitObject(obj, split.by = group.by)
if (!requireNamespace("glmGamPoi", quietly = TRUE)) {
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, vars.to.regress = "percent.mt", verbose = FALSE)
}else{
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, vars.to.regress = "percent.mt", verbose = FALSE, method = "glmGamPoi")
}
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = nfeatures)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
anchor.features = features)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")
immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction = "pca", dims = 1:dim.use)
all.integrated <- immune.combined.sct
all.integrated <- FindNeighbors(all.integrated, reduction = "pca", dims = 1:dim.use)
all.integrated <- FindClusters(all.integrated, resolution = res)
all.integrated <- RunUMAP(all.integrated, reduction = "pca", dims = 1:dim.use)
return(all.integrated)
}
方法三
第三種方法是一種降維整合崎坊,基于harmony包副编,這種方法的優(yōu)勢在于夠快,大部分情況都能有比較好的結果流强。參數(shù)與方法一一致痹届。
harmony_integ <- function(obj,group.by=NULL,
mt.pattern="^MT-",mt.list=NULL,dim.use=20,mt.cutoff=5,
nf.low=500,nf.high=6000,nfeatures=3000,
res=1.5) {
library(harmony)
all <- obj
if (is.null(mt.list)) {
all[["percent.mt"]] <- PercentageFeatureSet(all, pattern = mt.pattern)
}else{
mt.list <- mt.list[which(mt.list %in% rownames(all))]
all[["percent.mt"]] <- PercentageFeatureSet(all, features = mt.list)
}
all <- subset(all, subset = nFeature_RNA > nf.low & percent.mt < mt.cutoff & nFeature_RNA < nf.high)
all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(all)
all <- ScaleData(all , features = all.genes, vars.to.regress = "nCount_RNA")
saveRDS(all,"regress.rds")
all <- RunPCA(all, features = VariableFeatures(object = all))
all <- RunHarmony(all, group.by , plot_convergence = F,dims.use = 1:dim.use)
Combine <- all
Combine = RunTSNE(Combine, reduction = "harmony", dims = 1:dim.use)
Combine = RunUMAP(Combine, reduction = "harmony", dims = 1:dim.use)
Combine = FindNeighbors(Combine, reduction = "harmony",dims = 1:dim.use)
Combine = FindClusters(Combine, resolution = res)
return(Combine)
}
小結與補充
單細胞數(shù)據(jù)整合一直是個玄學,沒有說哪一種整合方法是最好的打月,不同方法針對不同樣本會出現(xiàn)不同效果队腐,只能每種方法都試一下才能知道哪種較好。而且還需要結合實際情況進行選擇奏篙,例如數(shù)據(jù)集太大柴淘,或者運行內(nèi)存不夠,可能harmony的方法更適合秘通,當然如果數(shù)據(jù)集適中为严,各種運行條件也合適那就可以考慮理論上更為合理的SCTransform方法。