批次效應(yīng) 是指在實(shí)驗(yàn)中由于不同的實(shí)驗(yàn)條件钉汗、操作人員移袍、試劑批次爷绘、設(shè)備差異等非生物學(xué)因素引起的系統(tǒng)性差異书劝。具體來說,在單細(xì)胞RNA測序(scRNA-seq)中土至,不同的實(shí)驗(yàn)批次可能會產(chǎn)生顯著的技術(shù)變異庄撮,這些變異會掩蓋實(shí)際的生物學(xué)信號,導(dǎo)致錯誤的分析結(jié)果毙籽。直觀來看洞斯,批次效應(yīng)表現(xiàn)為,在整體數(shù)據(jù)集進(jìn)行UMAP 和/或 tSNE 降維分析中坑赡,不同供體的數(shù)據(jù)不能 “很好地混合”烙如,未經(jīng)校正的批次效應(yīng)會導(dǎo)致不同批次的樣本聚集在一起,而不是根據(jù)細(xì)胞類型或狀態(tài)進(jìn)行聚類毅否。如果在此基礎(chǔ)上進(jìn)行更深的數(shù)據(jù)挖掘亚铁,就很可能會獲得虛假差異(例如來自單一供體的細(xì)胞群)。單細(xì)胞測序數(shù)據(jù)集的 整合分析(例如跨實(shí)驗(yàn)批次螟加、供體或條件) 通常是 scRNA-seq 工作流程中的關(guān)鍵步驟徘溢。通過綜合分析,可以匹配不同數(shù)據(jù)集中的共享細(xì)胞類型和狀態(tài)捆探,這不僅提高了統(tǒng)計能力然爆,更重要的是,促進(jìn)了跨數(shù)據(jù)集的準(zhǔn)確比較分析黍图。在單細(xì)胞分析的各種工作流中曾雕,Seurat
分析套件是目前應(yīng)用最廣泛的方法之一。隨著近年來 Seurat 團(tuán)隊的不斷更新助被,該分析框架已經(jīng)發(fā)展到了 v5 版本剖张。相較于之前的版本,Seurat v5
在數(shù)據(jù)結(jié)構(gòu)和分析方法上都進(jìn)行了許多重要的調(diào)整和改進(jìn)揩环。本帖收集了一些資料特別對新舊版本的數(shù)據(jù)整合工作流之間的一些差異進(jìn)行了匯總搔弄。
Seurat 標(biāo)準(zhǔn)整合工作流:Analysis, visualization, and integration of Visium HD spatial datasets with Seurat ? Seurat (satijalab.org)
1、Seurat v4 Integration Work-flow
在 Seurat v4 整合工作流程中丰滑,每個整合步驟中發(fā)生的情況非常清楚顾犹,如每個命令的文檔頁面底部所記錄。
1) SelectIntegrationFeatures()
Summary:
Choose the features to use when integrating multiple datasets. This function ranks features by the number of datasets they are deemed variable in, breaking ties by the median variable feature rank across datasets. It returns the top scoring features by this ranking.
2) FindIntegrationAnchors()
Summarized as this:
Perform dimensional reduction on the dataset pair as specified via the reduction parameter. If l2.norm is set to TRUE, perform L2 normalization of the embedding vectors.Identify anchors - pairs of cells from each dataset that are contained within each other's neighborhoods (also known as mutual nearest neighbors).
Filter low confidence anchors to ensure anchors in the low dimension space are in broad agreement with the high dimensional measurements. This is done by looking at the neighbors of each query cell in the reference dataset using max.features to define this space. If the reference cell isn't found within the first k.filter neighbors, remove the anchor.
Assign each remaining anchor a score. For each anchor cell, determine the nearest k.score anchors within its own dataset and within its pair's dataset. Based on these neighborhoods, construct an overall neighbor graph and then compute the shared neighbor overlap between anchor and query cells (analogous to an SNN graph). We use the 0.01 and 0.90 quantiles on these scores to dampen outlier effects and rescale to range between 0-1
3)IntegrateData()
Summary:
For pairwise integration:Construct a weights matrix that defines the association between each query cell and each anchor. These weights are computed as 1 - the distance between the query cell and the anchor divided by the distance of the query cell to the k.weightth anchor multiplied by the anchor score computed in FindIntegrationAnchors. We then apply a Gaussian kernel width a bandwidth defined by sd.weight and normalize across all k.weight anchors.
Compute the anchor integration matrix as the difference between the two expression matrices for every pair of anchor cells
Compute the transformation matrix as the product of the integration matrix and the weights matrix.
Subtract the transformation matrix from the original expression matrix.
It is clear then that the output of this integration workflow is a corrected matrix used for downstream analysis.
1.1 Seurat v4 Log-Normalization Based Integration
{
pacman::p_load(Seurat,SeuratData,dplyr)
options(future.globals.maxSize = 1e+12)
obt.list <- SplitObject(ifnb, split.by = "stim")
features <- SelectIntegrationFeatures(object.list = obt.list)
set.anchors <- FindIntegrationAnchors(object.list = obt.list, anchor.features = features)
set.combined <- IntegrateData(anchorset = set.anchors)
DefaultAssay(set.combined) <- 'integrated'
set.combined <- set.combined %>% RunUMAP(dims = 1:30,min.dist = 1e-5)
}
1.2 Seurat v4 SCT-Normalization Based Integration
{
pacman::p_load(Seurat,SeuratData,dplyr)
options(future.globals.maxSize = 1e+12)
ifnb <- UpdateSeuratObject(LoadData("ifnb"))
seuObj.list <- SplitObject(ifnb, split.by = "stim")
lapply(seuObj.list, function(obj){
obj %>% SCTransform(verbose = FALSE) %>%
RunPCA(verbose = FALSE)
})
features <- SelectIntegrationFeatures(object.list = seuObj.list, nfeatures = 3000)
seuObj.list <- PrepSCTIntegration(object.list = seuObj.list, anchor.features = features)
### default reduction is CCA
anchors <- FindIntegrationAnchors(object.list = seuObj.list, normalization.method = "SCT",anchor.features = features)
seuObj.combined.sct <- IntegrateData(anchorset = anchors, normalization.method = "SCT")
seuObj.combined.sct <- seuObj.combined.sct %>%
RunPCA( verbose = T) %>%
RunUMAP(dims = 1:30,min.dist = 1e-3, verbose = F)
}
2、Seurat v5 Integration Work-flow
在 Seurat v5 整合工作流程中蹦渣,相比起 v4哄芜,開發(fā)團(tuán)隊引入了
Layer
數(shù)據(jù)結(jié)構(gòu), 并且工作流程從代碼/可用性角度進(jìn)行了簡化柬唯,但 v5總體步驟與v4工作流程相同认臊。該流程的主要區(qū)別在于,在 Seurat v5 中锄奢,校正是在整個對象的全部細(xì)胞的low-dimensional space
上進(jìn)行的(默認(rèn)為PCA
空間)失晴,而不是將基因表達(dá)值本身作為校正的輸入,并且不再返回integrated assay
拘央,而是直接返回一個校正的低維空間 (默認(rèn)為integrated.dr
)涂屁。值得注意的是,在生成整個對象的PCA
空間的時候灰伟,Seurat v5 更新了FindVariableFeatures
的方法拆又,現(xiàn)流程在一個包含2+ Layers
的數(shù)據(jù)對象中查找 可變特征 時,會分別識別每一層的可變特征栏账,然后找出共同的可變特征帖族。接著,使用每個Layer
中最可變且在其他矩陣中也存在的特征挡爵,補(bǔ)充進(jìn)來直到達(dá)到所需的 n 個可變特征(這與Seurat v4中用于整合數(shù)據(jù)時識別特征的方法SelectIntegrationFeatures
相同)竖般,從而確保整合后的數(shù)據(jù)能夠準(zhǔn)確反映細(xì)胞之間的生物學(xué)差異,而不是技術(shù)差異茶鹃。此外涣雕,如果研究確實(shí)依賴
integrated assay
,Seurat v5 仍然支持舊的整合工作流程(使用IntegrateData
而不是IntegrateLayers
)闭翩。
見討論:
- Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) · Issue #8653 · satijalab/seurat (github.com)
- Seurat V5 FindVariableFeatures() and HarmonyIntegration() Question · Issue #8325 · satijalab/seurat (github.com)
2.1 Seurat v5 Log-Normalization Based Integration
{
pacman::p_load(Seurat,SeuratData,dplyr)
options(future.globals.maxSize = 1e+12)
ifnb <- UpdateSeuratObject(LoadData("ifnb"))
ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
ifnb <- ifnb %>% NormalizeData(verbose = F) %>%
FindVariableFeatures(verbose = F) %>%
ScaleData(verbose = F) %>%
RunPCA(verbose = F) %>%
IntegrateLayers(method = CCAIntegration,verbose = FALSE) %>%
RunUMAP(dims = 1:30,reduction = "integrated.dr",min.dist = 1e-3, verbose = F)
# re-join layers after integration
ifnb[["RNA"]] <- JoinLayers(ifnb[["RNA"]])
}
2.2 Seurat v5 SCT-Normalization Based Integration
pacman::p_load(Seurat,SeuratData,dplyr,ggplot2)
options(future.globals.maxSize = 1e+12)
{
ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
ifnb <- ifnb %>% SCTransform(verbose = F) %>%
RunPCA(verbose = F) %>%
IntegrateLayers(method = CCAIntegration, normalization.method = "SCT", verbose = F) %>%
RunUMAP(dims = 1:30,reduction = "integrated.dr",min.dist = 1e-3, verbose = F)
}
3挣郭、Compare Result of Seurat v4 & v5 SCT-Normalization Based Integration
通過對CCA方法校正后的UMAP表示進(jìn)行目視檢查,Seurat v4和v5 版本都能比較好的校正不同技術(shù)平臺引入的技術(shù)偏差男杈,并且保留了細(xì)胞類型的可區(qū)分性丈屹。
DimPlot(seuObj.combined.sct,group.by = "seurat_annotations",label = T) & labs(title = "V4 Integ") | DimPlot(ifnb,group.by = "seurat_annotations",label = T) & labs(title = "V5 Integ")
Reference
Seurat -- SCTransform_> e14 <- sctransform(object = e14) running sctrans
https://satijalab.org/seurat/archive/v4.3/integration_introduction
Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) · Issue #8653 · satijalab/seurat (github.com)