Seurat v4 與 v5 數(shù)據(jù)整合工作流的技術(shù)差異說明

批次效應(yīng) 是指在實(shí)驗(yàn)中由于不同的實(shí)驗(yàn)條件钉汗、操作人員移袍、試劑批次爷绘、設(shè)備差異等非生物學(xué)因素引起的系統(tǒng)性差異书劝。具體來說，在單細(xì)胞RNA測序（scRNA-seq）中土至，不同的實(shí)驗(yàn)批次可能會產(chǎn)生顯著的技術(shù)變異庄撮，這些變異會掩蓋實(shí)際的生物學(xué)信號，導(dǎo)致錯誤的分析結(jié)果毙籽。直觀來看洞斯，批次效應(yīng)表現(xiàn)為，在整體數(shù)據(jù)集進(jìn)行UMAP 和/或 tSNE 降維分析中坑赡，不同供體的數(shù)據(jù)不能 “很好地混合”烙如，未經(jīng)校正的批次效應(yīng)會導(dǎo)致不同批次的樣本聚集在一起，而不是根據(jù)細(xì)胞類型或狀態(tài)進(jìn)行聚類毅否。如果在此基礎(chǔ)上進(jìn)行更深的數(shù)據(jù)挖掘亚铁，就很可能會獲得虛假差異（例如來自單一供體的細(xì)胞群）。單細(xì)胞測序數(shù)據(jù)集的 整合分析（例如跨實(shí)驗(yàn)批次螟加、供體或條件） 通常是 scRNA-seq 工作流程中的關(guān)鍵步驟徘溢。通過綜合分析，可以匹配不同數(shù)據(jù)集中的共享細(xì)胞類型和狀態(tài)捆探，這不僅提高了統(tǒng)計能力然爆，更重要的是，促進(jìn)了跨數(shù)據(jù)集的準(zhǔn)確比較分析黍图。在單細(xì)胞分析的各種工作流中曾雕，Seurat 分析套件是目前應(yīng)用最廣泛的方法之一。隨著近年來 Seurat 團(tuán)隊的不斷更新助被，該分析框架已經(jīng)發(fā)展到了 v5 版本剖张。相較于之前的版本，Seurat v5 在數(shù)據(jù)結(jié)構(gòu)和分析方法上都進(jìn)行了許多重要的調(diào)整和改進(jìn)揩环。本帖收集了一些資料特別對新舊版本的數(shù)據(jù)整合工作流之間的一些差異進(jìn)行了匯總搔弄。

Seurat 標(biāo)準(zhǔn)整合工作流：Analysis, visualization, and integration of Visium HD spatial datasets with Seurat ? Seurat (satijalab.org)

1、Seurat v4 Integration Work-flow

在 Seurat v4 整合工作流程中丰滑，每個整合步驟中發(fā)生的情況非常清楚顾犹，如每個命令的文檔頁面底部所記錄。

1) SelectIntegrationFeatures()
Summary:
Choose the features to use when integrating multiple datasets. This function ranks features by the number of datasets they are deemed variable in, breaking ties by the median variable feature rank across datasets. It returns the top scoring features by this ranking.
2) FindIntegrationAnchors()
Summarized as this:
Perform dimensional reduction on the dataset pair as specified via the reduction parameter. If l2.norm is set to TRUE, perform L2 normalization of the embedding vectors.

Identify anchors - pairs of cells from each dataset that are contained within each other's neighborhoods (also known as mutual nearest neighbors).

Filter low confidence anchors to ensure anchors in the low dimension space are in broad agreement with the high dimensional measurements. This is done by looking at the neighbors of each query cell in the reference dataset using max.features to define this space. If the reference cell isn't found within the first k.filter neighbors, remove the anchor.

Assign each remaining anchor a score. For each anchor cell, determine the nearest k.score anchors within its own dataset and within its pair's dataset. Based on these neighborhoods, construct an overall neighbor graph and then compute the shared neighbor overlap between anchor and query cells (analogous to an SNN graph). We use the 0.01 and 0.90 quantiles on these scores to dampen outlier effects and rescale to range between 0-1

3)IntegrateData()
Summary:
For pairwise integration:

Construct a weights matrix that defines the association between each query cell and each anchor. These weights are computed as 1 - the distance between the query cell and the anchor divided by the distance of the query cell to the k.weightth anchor multiplied by the anchor score computed in FindIntegrationAnchors. We then apply a Gaussian kernel width a bandwidth defined by sd.weight and normalize across all k.weight anchors.

Compute the anchor integration matrix as the difference between the two expression matrices for every pair of anchor cells

Compute the transformation matrix as the product of the integration matrix and the weights matrix.

Subtract the transformation matrix from the original expression matrix.

It is clear then that the output of this integration workflow is a corrected matrix used for downstream analysis.

1.1 Seurat v4 Log-Normalization Based Integration

{
    pacman::p_load(Seurat,SeuratData,dplyr)
   options(future.globals.maxSize = 1e+12)

    obt.list <- SplitObject(ifnb, split.by = "stim")
    features <- SelectIntegrationFeatures(object.list = obt.list)
    set.anchors <- FindIntegrationAnchors(object.list = obt.list, anchor.features = features)
    set.combined <- IntegrateData(anchorset = set.anchors)
    
    DefaultAssay(set.combined) <- 'integrated'
    set.combined <- set.combined %>% RunUMAP(dims = 1:30,min.dist = 1e-5) 
}

1.2 Seurat v4 SCT-Normalization Based Integration

{
    pacman::p_load(Seurat,SeuratData,dplyr)
   options(future.globals.maxSize = 1e+12)

    ifnb <- UpdateSeuratObject(LoadData("ifnb"))

    seuObj.list <- SplitObject(ifnb, split.by = "stim")
    lapply(seuObj.list, function(obj){
        obj %>% SCTransform(verbose = FALSE) %>%
            RunPCA(verbose = FALSE)
    })
    
    features <- SelectIntegrationFeatures(object.list = seuObj.list, nfeatures = 3000)
    seuObj.list <- PrepSCTIntegration(object.list = seuObj.list, anchor.features = features)
    ### default reduction is CCA
    anchors <- FindIntegrationAnchors(object.list = seuObj.list, normalization.method = "SCT",anchor.features = features)
    seuObj.combined.sct <- IntegrateData(anchorset = anchors, normalization.method = "SCT")
    seuObj.combined.sct <- seuObj.combined.sct %>% 
        RunPCA( verbose = T) %>% 
        RunUMAP(dims = 1:30,min.dist = 1e-3, verbose = F) 
}

2、Seurat v5 Integration Work-flow

在 Seurat v5 整合工作流程中蹦渣，相比起 v4哄芜，開發(fā)團(tuán)隊引入了 Layer 數(shù)據(jù)結(jié)構(gòu)，并且工作流程從代碼/可用性角度進(jìn)行了簡化柬唯，但 v5總體步驟與v4工作流程相同认臊。該流程的主要區(qū)別在于，在 Seurat v5 中锄奢，校正是在整個對象的全部細(xì)胞的 low-dimensional space 上進(jìn)行的（默認(rèn)為 PCA 空間）失晴，而不是將基因表達(dá)值本身作為校正的輸入，并且不再返回 integrated assay拘央，而是直接返回一個校正的低維空間 (默認(rèn)為 integrated.dr)涂屁。值得注意的是，在生成整個對象的PCA空間的時候灰伟，Seurat v5 更新了 FindVariableFeatures 的方法拆又，現(xiàn)流程在一個包含 2+ Layers 的數(shù)據(jù)對象中查找 可變特征 時，會分別識別每一層的可變特征栏账，然后找出共同的可變特征帖族。接著，使用每個 Layer 中最可變且在其他矩陣中也存在的特征挡爵，補(bǔ)充進(jìn)來直到達(dá)到所需的 n 個可變特征（這與Seurat v4中用于整合數(shù)據(jù)時識別特征的方法 SelectIntegrationFeatures 相同）竖般，從而確保整合后的數(shù)據(jù)能夠準(zhǔn)確反映細(xì)胞之間的生物學(xué)差異，而不是技術(shù)差異茶鹃。

此外涣雕，如果研究確實(shí)依賴 integrated assay，Seurat v5 仍然支持舊的整合工作流程（使用 IntegrateData 而不是 IntegrateLayers）闭翩。

見討論：

Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) · Issue #8653 · satijalab/seurat (github.com)
Seurat V5 FindVariableFeatures() and HarmonyIntegration() Question · Issue #8325 · satijalab/seurat (github.com)

2.1 Seurat v5 Log-Normalization Based Integration

{
    pacman::p_load(Seurat,SeuratData,dplyr)
    options(future.globals.maxSize = 1e+12)

    ifnb <- UpdateSeuratObject(LoadData("ifnb"))
    ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
    ifnb <- ifnb %>% NormalizeData(verbose = F) %>% 
        FindVariableFeatures(verbose = F) %>% 
        ScaleData(verbose = F) %>% 
        RunPCA(verbose = F) %>% 
        IntegrateLayers(method = CCAIntegration,verbose = FALSE) %>% 
        RunUMAP(dims = 1:30,reduction = "integrated.dr",min.dist = 1e-3, verbose = F)
    
    # re-join layers after integration
    ifnb[["RNA"]] <- JoinLayers(ifnb[["RNA"]])
}

2.2 Seurat v5 SCT-Normalization Based Integration

pacman::p_load(Seurat,SeuratData,dplyr,ggplot2)
options(future.globals.maxSize = 1e+12)

{
    ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
    ifnb <- ifnb %>% SCTransform(verbose = F) %>% 
        RunPCA(verbose = F) %>% 
        IntegrateLayers(method = CCAIntegration, normalization.method = "SCT", verbose = F) %>% 
        RunUMAP(dims = 1:30,reduction = "integrated.dr",min.dist = 1e-3, verbose = F)
}

3挣郭、Compare Result of Seurat v4 & v5 SCT-Normalization Based Integration

通過對CCA方法校正后的UMAP表示進(jìn)行目視檢查，Seurat v4和v5 版本都能比較好的校正不同技術(shù)平臺引入的技術(shù)偏差男杈，并且保留了細(xì)胞類型的可區(qū)分性丈屹。

DimPlot(seuObj.combined.sct,group.by = "seurat_annotations",label = T) & labs(title = "V4 Integ") | DimPlot(ifnb,group.by = "seurat_annotations",label = T) & labs(title = "V5 Integ")

Reference

Seurat -- SCTransform_> e14 <- sctransform(object = e14) running sctrans
https://satijalab.org/seurat/archive/v4.3/integration_introduction
Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) · Issue #8653 · satijalab/seurat (github.com)

最后編輯于：2024.07.24 17:01:51

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

禁止轉(zhuǎn)載，如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者伶棒。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市彩库，隨后出現(xiàn)的幾起案子肤无，更是在濱河造成了極大的恐慌，老刑警劉巖骇钦，帶你破解...
沈念sama閱讀 206,013評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件宛渐，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)窥翩，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門业岁，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人寇蚊，你說我怎么就攤上這事笔时。” “怎么了仗岸？”我有些...
開封第一講書人閱讀 152,370評論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵允耿，是天一觀的道長。經(jīng)常有香客問我扒怖，道長较锡，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,168評論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任盗痒，我火速辦了婚禮蚂蕴，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘俯邓。我一直安慰自己骡楼，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布看成。她就那樣靜靜地躺著君编，像睡著了一般。火紅的嫁衣襯著肌膚如雪川慌。梳的紋絲不亂的頭發(fā)上吃嘿，一...
開封第一講書人閱讀 48,954評論 1贊 283
城市分裂傳說
那天，我揣著相機(jī)與錄音梦重，去河邊找鬼兑燥。笑死，一個胖子當(dāng)著我的面吹牛琴拧，可吹牛的內(nèi)容都是我干的降瞳。我是一名探鬼主播，決...
沈念sama閱讀 38,271評論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼蚓胸，長吁一口氣：“原來是場噩夢啊……” “哼挣饥！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起沛膳，我...
開封第一講書人閱讀 36,916評論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤扔枫，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后锹安，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體短荐，經(jīng)...
沈念sama閱讀 43,382評論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡倚舀，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了忍宋。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片痕貌。...
茶點(diǎn)故事閱讀 37,989評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖糠排，靈堂內(nèi)的尸體忽然破棺而出舵稠，到底是詐尸還是另有隱情，我是刑警寧澤乳讥，帶...
沈念sama閱讀 33,624評論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布柱查，位于F島的核電站，受9級特大地震影響云石，放射性物質(zhì)發(fā)生泄漏唉工。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一汹忠、第九天我趴在偏房一處隱蔽的房頂上張望淋硝。院中可真熱鬧，春花似錦宽菜、人聲如沸谣膳。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評論 0贊 19
一樁弒父案铅乡，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽继谚。三九已至，卻和暖如春阵幸，著一層夾襖步出監(jiān)牢的瞬間花履，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,418評論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工挚赊，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留诡壁，地道東北人。一個月前我還...
沈念sama閱讀 45,401評論 2贊 352
代替公主和親
正文我出身青樓荠割，卻偏偏與公主長得像妹卿，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子蔑鹦，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評論 2贊 345