10X空間轉(zhuǎn)錄組數(shù)據(jù)分析之空間注釋(解卷積沃暗,STdeconvolve)

hello月洛,大家好,今天我們來分享一個新的空間注釋的方法孽锥,當(dāng)然主要思想還是解卷積嚼黔,關(guān)于空間轉(zhuǎn)錄組的注釋方法细层,我已經(jīng)分享了很多了,在這里列舉出來唬涧,供大家參考疫赎。

MIA用于單細(xì)胞和空間的聯(lián)合分析

10X單細(xì)胞和空間聯(lián)合分析的方法---cell2location

10X空間轉(zhuǎn)錄組和10X單細(xì)胞數(shù)據(jù)聯(lián)合分析方法匯總

10X單細(xì)胞空間聯(lián)合分析之四----DSTG

10X單細(xì)胞空間聯(lián)合分析之三----Spotlight

10X單細(xì)胞空間聯(lián)合分析之五----spatialDWLS

10X單細(xì)胞空間聯(lián)合分析之六(依據(jù)每個spot的細(xì)胞數(shù)量進(jìn)行單細(xì)胞空間聯(lián)合分析)----Tangram

10X單細(xì)胞-10X空間轉(zhuǎn)錄組聯(lián)合分析之七----CellDART

當(dāng)然還有依據(jù)marker注釋空間轉(zhuǎn)錄組的方法,10X空間轉(zhuǎn)錄組數(shù)據(jù)分析之思路總結(jié)(針對腫瘤樣本)

10X單細(xì)胞-10X空間轉(zhuǎn)錄組聯(lián)合分析之八----STRIDE(三維重構(gòu))

當(dāng)然了碎节,方法很多捧搞,不見得都是適合自己的樣本分析,每種方法都有其優(yōu)勢和不足钓株,我們要根據(jù)自己的實際情況進(jìn)行選擇实牡。好了陌僵,開始我們今天的分享轴合,文章在Reference-free cell-type deconvolution of pixel-resolution spatially resolved transcriptomics data,用到的方法是STdeconvolve碗短,下面放幾張分析效果圖受葛。

圖片.png

圖片.png

圖片.png

當(dāng)然了,空間轉(zhuǎn)錄組可分析的內(nèi)容是在太多了偎谁,如果大家有興趣总滩,可以在文章的下面評論一下,如果人員比較多的話巡雨,我就給大家講一節(jié)公開課闰渔,大家一起交流交流,好了铐望,開始我們今天的內(nèi)容分享冈涧。

Abstract

最近的技術(shù)進(jìn)步使空間解析轉(zhuǎn)錄組分析成為可能,但在多細(xì)胞像素分辨率下正蛙,從而阻礙了細(xì)胞類型空間共定位模式的識別统阿。 作者開發(fā)了 STdeconvolve 作為一種無監(jiān)督方法來對包含這種多細(xì)胞像素分辨率空間分辨轉(zhuǎn)錄組學(xué)數(shù)據(jù)集的底層細(xì)胞類型進(jìn)行去卷積开皿。 實際運(yùn)用表明 STdeconvolve 有效地恢復(fù)了細(xì)胞類型的假定轉(zhuǎn)錄組學(xué)特征及其在空間分辨像素內(nèi)的比例表示,而無需依賴外部單細(xì)胞轉(zhuǎn)錄組學(xué)參考(如果真的不以來單細(xì)胞數(shù)據(jù),那真的是很完美们妥,我們往下看看)。

Main

描繪組織內(nèi)轉(zhuǎn)錄不同細(xì)胞類型的空間組織對于理解組織功能的細(xì)胞基礎(chǔ)至關(guān)重要眯杏。最近的技術(shù)已經(jīng)能夠以多細(xì)胞像素分辨率在組織內(nèi)進(jìn)行空間分辨轉(zhuǎn)錄組 (ST) 分析匾乓。因此,這些 ST 測量代表可能包含多種細(xì)胞類型的細(xì)胞混合物鳄厌。這種缺乏單細(xì)胞分辨率阻礙了細(xì)胞類型特定空間組織的表征荞胡。為了應(yīng)對這一挑戰(zhàn),最近開發(fā)了 SPOTlight 和 RCTD 等監(jiān)督解卷積方法來預(yù)測 ST 像素內(nèi)細(xì)胞類型的比例部翘。然而硝训,這些有監(jiān)督的反卷積方法依賴于合適的單細(xì)胞參考的可用性,如果由于預(yù)算、技術(shù)或生物學(xué)限制而沒有這樣的參考窖梁,則可能會出現(xiàn)限制赘风。在這里,作者開發(fā)了 STdeconvolve作為一種用于解卷積多細(xì)胞像素分辨率 ST 數(shù)據(jù)的無監(jiān)督纵刘、無參考的分析方法(下圖)邀窃。
圖片.png

圖片.png
給定 ST 數(shù)據(jù)的計數(shù)矩陣,STdeconvolve 推斷不同細(xì)胞類型的推定轉(zhuǎn)錄組特征及其在每個多細(xì)胞空間分辨 ST 像素內(nèi)的比例表示假哎。 簡而言之瞬捕,STdeconvolve 的第一個特征選擇可能提供轉(zhuǎn)錄不同細(xì)胞類型信息的基因。 STdeconvolve 然后基于Latent Dirichlet Allocation (LDA) 來估計轉(zhuǎn)錄不同細(xì)胞類型 K 的數(shù)量舵抹,并在 ST 像素中對這些 K 細(xì)胞類型進(jìn)行解卷積肪虎。 STdeconvolve 利用 ST 數(shù)據(jù)的幾個獨特功能,使 LDA 的此應(yīng)用程序特別適合惧蛹。
首先評估了 STdeconvolve 在使用模擬 ST 數(shù)據(jù)恢復(fù)細(xì)胞類型及其轉(zhuǎn)錄組學(xué)特征的比例表示方面的性能扇救。 這是通過將小鼠內(nèi)側(cè)視前區(qū) (MPOA) 的單細(xì)胞分辨率多重錯誤魯棒熒光原位雜交 (MERFISH) 數(shù)據(jù)聚合到 100 μm2 像素中來完成的。
圖片.png
  • 注:Ground truth single-cell resolution MERFISH data of one section of the MPOA partitioned into 100 μm2 pixels (black dashed squares). Each dot is a single cell colored by its ground truth cell-type label.
應(yīng)用 STdeconvolve香嗓,確定了 K=9 轉(zhuǎn)錄不同的細(xì)胞類型迅腔,并在每個模擬像素中對它們的轉(zhuǎn)錄組譜和比例表示進(jìn)行了反卷積(下圖)。
圖片.png
  • 注:B. Proportions of deconvolved cell-types from STdeconvolve, represented as pie charts for each simulated pixel靠娱。
為了推斷去卷積細(xì)胞類型的身份以進(jìn)行基準(zhǔn)測試沧烈,將它們的去卷積轉(zhuǎn)錄譜與真實細(xì)胞類型的轉(zhuǎn)錄譜進(jìn)行了匹配(下圖)。
圖片.png
  • 注:(B)Pearson’s correlation between the transcriptional profiles of the 9 ground truth cell-types in the MERFISH MPOA data and the 9 deconvolved cell-types by STdeconvolve.(C). Distributions of Pearson’s correlations between the transcriptional profiles or pixel proportions of matched deconvolved and ground truth cell-types. Matched cell-types are indicated as highlighted boxes in B.
觀察到每個去卷積細(xì)胞類型的轉(zhuǎn)錄組譜與跨基因匹配的真實細(xì)胞類型之間的強(qiáng)相關(guān)性像云,同樣地锌雀,在模擬像素中每個去卷積細(xì)胞類型和匹配的真實細(xì)胞類型的比例之間存在強(qiáng)相關(guān)性。
圖片.png
  • 注:(C)The ranking of each gene based on its expression level in the transcriptional profiles of the deconvolved cell-types, compared to its gene rank in the transcriptional profile of the matched ground truth cell-type. (D). Heatmap of Pearson’s correlations between the proportions of the deconvolved cell-types and ground truth cell-types across simulated pixels. Ground truth cell-types are ordered by their frequencies in the ground truth dataset. Matched deconvolved and ground truth cell-types are boxed.
一些去卷積的細(xì)胞類型苫费,如細(xì)胞類型 X2 和 X8 都與興奮性神經(jīng)元匹配汤锨,而細(xì)胞類型 X4 和 X7 都與抑制性神經(jīng)元匹配。 根據(jù)先前的注釋百框,將真實的興奮性和抑制性細(xì)胞類型進(jìn)一步劃分為其他亞型(共 76 種)闲礼,發(fā)現(xiàn)這些去卷積的細(xì)胞類型與神經(jīng)元亞型的特定組合相關(guān)
圖片.png
當(dāng)進(jìn)一步將去卷積細(xì)胞類型的數(shù)量擴(kuò)大到 K=76 時,能夠識別在轉(zhuǎn)錄譜和像素比例與更精細(xì)的神經(jīng)元亞型以及稀有細(xì)胞類型方面高度相關(guān)的去卷積細(xì)胞類型such as pericytes and microglia铐维。此外柬泽,由于當(dāng)前的 ST 技術(shù)允許以不同的分辨率進(jìn)行空間轉(zhuǎn)錄組分析,因此進(jìn)一步模擬了另一個 20 μm2 分辨率的 ST 數(shù)據(jù)集嫁蛇,并觀察到解卷積的細(xì)胞類型轉(zhuǎn)錄組譜與 STdeconvolve 的基本事實的比例之間類似的強(qiáng)相關(guān)性(下圖).
圖片.png
  • 注:Comparison of STdeconvolve cell-types to ground truth cell-types of a 20 μm2 197 pixel MERFISH MPOA ST dataset锨并。
使用模擬的 MPOA 的 100 μm2 分辨率 ST 數(shù)據(jù),還將 STdeconvolve 與現(xiàn)有的監(jiān)督去卷積方法 SPOTlight 和 RCTD 進(jìn)行了比較睬棚。 對于理想的單細(xì)胞轉(zhuǎn)錄組學(xué)參考第煮,使用了原始單細(xì)胞 MERFISH 數(shù)據(jù)解幼。 使用去卷積的細(xì)胞類型比例的均方根誤差 (RMSE) 與模擬像素上的真實情況(方法)來評估每種方法的性能。 總的來說包警,發(fā)現(xiàn) STdeconvolve 的性能與 SPOTlight和 RCTD 相當(dāng).此類現(xiàn)有監(jiān)督解卷積方法的一個潛在限制是它們依賴于合適的單細(xì)胞參考撵摆。 因此,試圖通過從 MERFISH 單細(xì)胞參考中去除神經(jīng)元細(xì)胞類型來評估它們在沒有合適的單細(xì)胞參考時的性能害晦。 重新評估性能導(dǎo)致 SPOTlight和 RCTD 的 RMSE 增加.
接下來特铝,通過分析小鼠主嗅球(MOB)的 100 μm2 分辨率 ST 數(shù)據(jù)來評估 STdeconvolve 的性能, 由于topographically organized的感官輸入,MOB 由多個雙邊對稱和轉(zhuǎn)錄不同的細(xì)胞層組成壹瘟。 雖然之前對 MOB ST 數(shù)據(jù)的聚類分析揭示了粗細(xì)胞層的粗略空間組織鲫剿,但無法輕易觀察到更精細(xì)的結(jié)構(gòu),例如喙遷移流 (RMS) .We applied STdeconvolve to identify K=12 transcriptionally distinct cell-types that either overlapped with or further split coarse cell layers previously identified from clustering analysis稻轨。
圖片.png
  • 注:Deconvolved cell-type proportions for ST data of the MOB from STdeconvolve, represented as pie charts for each ST pixel. Pixels are outlined with colors based on the pixel transcriptional cluster assignment corresponding to MOB coarse cell layers.
特別是灵莲,雖然去卷積的細(xì)胞類型 X7 與先前通過聚類分析確定的顆粒細(xì)胞層重疊,但它在空間上位于預(yù)期 RMS 的位置澄者。
圖片.png
已知在其去卷積轉(zhuǎn)錄譜中上調(diào)的基因笆呆,包括 Nrep、Sox11 和 Dcx粱挡,與神經(jīng)元分化相關(guān)或在 RMS 內(nèi)的神經(jīng)元前體細(xì)胞中上調(diào),并根據(jù) ISH 染色標(biāo)記預(yù)期的空間組織俄精。
圖片.png
這表明去卷積的細(xì)胞類型 X7 對應(yīng)于未從聚類分析中識別的 RMS 內(nèi)的神經(jīng)元前體細(xì)胞類型询筏。 同樣,使用適當(dāng)?shù)?MOB scRNA-seq 參考將 STdeconvolve 與 SPOTlight和 RCTD 進(jìn)行了比較竖慧,并發(fā)現(xiàn)所有評估方法之間具有高度的對應(yīng)性嫌套。
通過從 MOB scRNA-seq 參考中去除嗅鞘細(xì)胞 (OEC),再次比較了在缺乏合適的單細(xì)胞參考時這種監(jiān)督解卷積方法的性能圾旨。 通過所有評估方法踱讨,最初預(yù)測 OECs 在嗅覺神經(jīng)層中富集
圖片.png
然而,鑒于這個沒有 OEC 的新參考砍的, SPOTlight和 RCTD 錯誤地預(yù)測 N2 細(xì)胞在嗅覺神經(jīng)層中富集并且高度豐富痹筛,盡管最初所有方法都預(yù)測 N2 細(xì)胞是罕見的。 此外廓鞠,在來自小鼠皮層的 scRNA-seq 參考上訓(xùn)練了 SPOTlight和 RCTD帚稠,導(dǎo)致皮層參考的血管軟腦膜 (VLMC) 細(xì)胞簇被錯誤地預(yù)測為在嗅覺神經(jīng)層中高度富集 . 因此,監(jiān)督解卷積方法可能對使用的單細(xì)胞轉(zhuǎn)錄組學(xué)參考敏感床佳。
圖片.png
最后滋早,為了證明無監(jiān)督、無參考方法的潛力砌们,將 STdeconvolve 應(yīng)用于 4 個乳腺癌切片的 100 μm2 分辨率 ST 數(shù)據(jù)杆麸。 在這里搁进,匹配的 scRNA-seq 參考不可用,并且由于潛在的腫瘤間異質(zhì)性昔头,使用來自另一個乳腺癌樣本的 scRNA-seq 參考可能是不合適的拷获。 ST 像素的轉(zhuǎn)錄聚類先前確定了 3 個轉(zhuǎn)錄不同的cluster,對應(yīng)于 3 組織的組織學(xué)區(qū)域:原位導(dǎo)管癌减细、浸潤性癌和非惡性.
圖片.png
然而匆瓜,腫瘤微環(huán)境是許多其他細(xì)胞類型的復(fù)雜環(huán)境。 應(yīng)用 STdeconvolve 來識別額外的細(xì)胞類型并詢問它們的空間組織未蝌,導(dǎo)致 K=15 識別的細(xì)胞類型.
圖片.png
  • 注:(A)Deconvolved cell-type pixel proportions for ST data of a breast cancer tissue section, represented as pie charts. Pixels are outlined with colors based on the pixel transcriptional cluster assignment corresponding to 3 pathological annotations. (B). Highlight of deconvolved cell-type X15. Pixel proportion of deconvolved cell-type X15 are indicated as black slices in pie charts. Pixels are outlined with colors as in A). An H&E-stained image of the breast cancer tissue is shown in the background.
其中驮吱,去卷積的細(xì)胞類型 X3、X13 和 X15 在空間上對應(yīng)萧吠,并且去卷積的轉(zhuǎn)錄譜分別與非惡性左冬、導(dǎo)管原位癌和浸潤性癌注釋一致。然而纸型,去卷積細(xì)胞型X15在組織的癌性和非惡性區(qū)域的界面處空間富集拇砰。去卷積細(xì)胞型X15的高表達(dá)基因包括免疫基因如CD74和CXCL10,基因集富集分析表明顯著富集 在免疫過程中表明去卷積的細(xì)胞類型 15 可能對應(yīng)于免疫細(xì)胞狰腌。 這種沿腫瘤邊界的免疫細(xì)胞空間組織先前已被認(rèn)為在乳腺癌預(yù)后中發(fā)揮作用除破。 因此 STdeconvolve 可能有助于解卷積異質(zhì)組織中轉(zhuǎn)錄不同的細(xì)胞類型,以發(fā)現(xiàn)潛在的臨床相關(guān)空間組織琼腔。
總之瑰枫,作者開發(fā)了 STdeconvolve 作為分析 ST 數(shù)據(jù)的工具,以在不依賴單細(xì)胞轉(zhuǎn)錄組學(xué)參考的情況下恢復(fù)細(xì)胞類型比例和轉(zhuǎn)錄譜丹莲。 分析表明光坝,當(dāng)合適的單細(xì)胞參考可用時,STdeconvolve 可以概括預(yù)期的生物學(xué)并提供與現(xiàn)有監(jiān)督方法相比具有競爭力的性能甥材,以及當(dāng)合適的單細(xì)胞參考不可用時盯另,具有潛在的優(yōu)越性能。 一般來說洲赵,預(yù)計 STdeconvolve 將有助于研究復(fù)雜異質(zhì)組織中轉(zhuǎn)錄不同細(xì)胞類型之間的空間關(guān)系鸳惯。

Method(令人頭疼的算法又要來了)

圖片.png

LDA Modeling(這個模型,是我們應(yīng)該關(guān)注的重點)

圖片.png

圖片.png

Selection of genes for LDA model

圖片.png

圖片.png

Selection of LDA model with optimal number of cell-types

圖片.png

圖片.png

方法間的比較

圖片.png

圖片.png

圖片.png

最后板鬓,我們來看看示例代碼

加載
require(remotes)
remotes::install_github('JEFworks-Lab/STdeconvolve')
library(STdeconvolve)
data(mOB)
pos <- mOB$pos
cd <- mOB$counts
annot <- mOB$annot
STdeconvolve 第一個特征通過尋找跨 ST 像素高度過度分散的基因悲敷,選擇最有可能與區(qū)分細(xì)胞類型相關(guān)的基因。 基因太少或讀取太少的基因的像素也可以被去除俭令。
## remove pixels with too few genes
counts <- cleanCounts(counts = cd,
                      min.lib.size = 100,
                      min.reads = 1,
                      min.detected = 1)
圖片.png
## feature select for genes
corpus <- restrictCorpus(counts,
                         removeAbove=1.0,
                         removeBelow = 0.05,
                         alpha = 0.05,
                         plot = TRUE,
                         verbose = TRUE)
圖片.png
STdeconvolve 然后應(yīng)用 Latent Dirichlet allocation (LDA)(一種常用于自然語言處理的生成統(tǒng)計模型)來發(fā)現(xiàn) K 種潛在細(xì)胞類型后德。 STdeconvolve 適合一系列 LDA 模型,以告知最佳K 的選擇抄腔。
## Note: the input corpus needs to be an integer count matrix of pixels x genes
ldas <- fitLDA(t(as.matrix(corpus)), Ks = seq(2, 9, by = 1), plot=TRUE, verbose=TRUE)
圖片.png
In this example, we will use the model with the lowest model perplexity.
The shaded region indicates where a fitted model for a given K had an alpha > 1. alpha is an LDA parameter that is solved for during model fitting and corresponds to the shape parameter of a symmetric Dirichlet distribution. In the model, this Dirichlet distribution describes the cell-type proportions in the pixels. A symmetric Dirichlet with alpha > 1 would lead to more uniform cell-type distributions in the pixels and difficulty identifying distinct cell-types. Instead, we want models with alphas < 1, resulting in sparse distributions where only a few cell-types are represented in a given pixel.
The resulting theta matrix can be interpreted as the proportion of each deconvolved cell-type across each spatially resolved pixel. The resulting beta matrix can be interpreted as the putative gene expression profile for each deconvolved cell-type normalized to a library size of 1. This beta matrix can be scaled by a depth factor (ex. 1000) for interpretability.
## select model with minimum perplexity
optLDA <- optimalModel(models = ldas, opt = "min")
## extract pixel cell-type proportions (theta) and cell-type gene expression profiles (beta) for the given dataset
results <- getBetaTheta(optLDA, corpus = t(as.matrix(corpus)))
deconProp <- results$theta
deconGexp <- results$beta*1000
vizAllTopics(deconProp, pos, 
             groups = annot, 
             group_cols = rainbow(length(levels(annot))),
             r=0.4)
圖片.png
還可以可視化每種去卷積細(xì)胞類型的頂級標(biāo)記基因瓢湃。 我們將在這里使用解卷積的細(xì)胞類型 5 和 1 作為示例理张。 我們將這里的頂級標(biāo)記基因定義為在去卷積細(xì)胞類型(計數(shù) > 5)中高度表達(dá)的基因,當(dāng)將去卷積細(xì)胞類型的表達(dá)譜與所有其他細(xì)胞類型的平均值進(jìn)行比較時绵患,這些基因也具有前 4 個最高 log2(倍數(shù)變化) 去卷積細(xì)胞類型的表達(dá)譜雾叭。
celltype <- 5
## highly expressed in cell-type of interest
highgexp <- names(which(deconGexp[celltype,] > 5))
## high log2(fold-change) compared to other deconvolved cell-types
log2fc <- sort(log2(deconGexp[celltype,highgexp]/colMeans(deconGexp[-celltype,highgexp])), decreasing=TRUE)
markers <- names(log2fc)[1:4]
## visualize
df <- merge(as.data.frame(pos), 
            as.data.frame(t(as.matrix(counts[markers,]))), 
            by = 0)
ps <- lapply(markers, function(marker) {
  vizGeneCounts(df = df,
              gene = marker,
              size = 2, stroke = 0.1,
              plotTitle = marker,
              winsorize = 0.05,
              showLegend = FALSE)
})
gridExtra::grid.arrange(
  grobs = ps,
  layout_matrix = rbind(c(1, 2),
                        c(3, 4))
)
圖片.png
celltype <- 1
## highly expressed in cell-type of interest
highgexp <- names(which(deconGexp[celltype,] > 5))
## high log2(fold-change) compared to other deconvolved cell-types
log2fc <- sort(log2(deconGexp[celltype,highgexp]/colMeans(deconGexp[-celltype,highgexp])), decreasing=TRUE)
markers <- names(log2fc)[1:4]
## visualize
df <- merge(as.data.frame(pos), 
            as.data.frame(t(as.matrix(counts[markers,]))), 
            by = 0)
ps <- lapply(markers, function(marker) {
  vizGeneCounts(df = df,
              gene = marker,
              size = 2, stroke = 0.1,
              plotTitle = marker,
              winsorize = 0.05,
              showLegend = FALSE)
})
gridExtra::grid.arrange(
  grobs = ps,
  layout_matrix = rbind(c(1, 2),
                        c(3, 4))
)
圖片.png

當(dāng)然了,軟件還有其他的一些功能

In this tutorial, we will describe additional functionalities and advanced features of STdeconvolve.

Preprocessing

For LDA model fitting, STdeconvolve requires a “corpus of documents”, which is represented as a pixel (rows) x feature (columns) matrix of non-negative integer gene counts. To effectively deconvolve latent cell-types, the features in the corpus should be limited to genes that are variable across cell-types. Without prior knowledge of cell-types and their marker genes, one could look for over dispersed genes across the pixels, assuming that the underlying cell-types are present at variable proportions.
Additionally, it is useful to reduce the number of features in the corpus to those which are the most informative, to not only improve deconvolution but also increase speed of model fitting. To this end, one may wish to remove genes that are present in most or all pixels, or occur in only a small fraction of the pixels. This is because LDA identifies latent topics, or cell-types, by looking for sets of genes that occur frequently together in pixels. Therefore, removing genes that are present in most or all pixels will help restrict to gene that are cell-type specific, assuming that cell-types are not present across all pixels. Conversely, genes that are present in only a few pixels could represent noise and may not actually represent robust cell-type specific signatures.
As previously mentioned, a set of marker genes may be known a priori and a user may want to include these in the final corpus.
We include 2 different functions with STdeconvolve.
The first is restrictCorpus(), which is highlighted in the getting_started tutorial(就是上面的內(nèi)容). This function first feature selects for over dispersed genes in the corpus and then allows a user to restrict the over dispersed genes to those present in less than or more than specified fractions of pixels in the dataset. This function does not filter for poor pixels or genes itself.
The second is preprocess(), which is a larger wrapper function and allows for a much greater and specified range of filtered and feature
library(STdeconvolve)
data(mOB)
pos <- mOB$pos
cd <- mOB$counts
annot <- mOB$annot
In general, preprocess() includes a step to remove poor pixels and genes, allows one to select for specific genes to include or remove, allows an option to select for over dispersed genes, and options to remove top expressed genes, or genes present in less than or more than specified fractions of pixels in the dataset. Further, it returns an organized list that contains the filtered corpus, and the positions of the pixels retained in the filtered corpus if the information is present in the pixel names originally (for example, if the name of a pixel is in the format “XxY”). Otherwise this can be appended after.
Lastly, preprocess() can take as input a pixel (row) x gene (columns) matrix or a path to the file.
Order of filtering options in which they occur: 1. Selection to use specific genes only 2. cleanCounts to remove poor pixels and genes 3. Remove top expressed genes in matrix 4. Remove specific genes based on grepl pattern matching 5. Remove genes that appear in more/less than a percentage of pixels 6. Use the over dispersed genes computed from the remaining genes after filtering steps 1-5 (if selected) 7. Choice to use the top over dispersed genes based on -log10(p.adj)
mobCorpus1 <- preprocess(t(cd),
                       alignFile = NA, # if there is a file to adjust pixel coordinates this can be included.
                       extractPos = FALSE, # optional argument
                       selected.genes = NA, # 
                       nTopGenes = 3, # remove the top 3 expressed genes (genes with most counts) in dataset
                       genes.to.remove = c("^Trmt"), # ex: remove tRNA methyltransferase genes (gene names that begin with "Trmt")
                       removeAbove = 0.95, # remove genes present in 95% or more of pixels
                       removeBelow = 0.05, # remove genes present in 5% or less of pixels
                       min.reads = 10, # minimum number of reads a gene must have across pixels
                       min.lib.size = 100, # minimum number of reads a pixel must have to keep (before gene filtering)
                       min.detected = 1, # minimum number of pixels a gene needs to have been detected in
                       ODgenes = TRUE, # feature select for over dispersed genes
                       nTopOD = 100, # number of top over dispersed genes to use, otherwise use all that pass filters if `NA`
                       od.genes.alpha = 0.05, # alpha param for over dispersed genes
                       gam.k = 5, # gam param for over dispersed genes
                       verbose = TRUE)
圖片.png

圖片.png
mobCorpus1$pos <- pos[rownames(mobCorpus1$corpus), ] # because positions were not available in the counts matrix itself, append after.
mobCorpus1$slm
## A 260x100 simple triplet matrix.
print(mobCorpus1$corpus[1:10,1:10])
##                    Bpifb9a Bpifb9b Col1a1 Dcn Cyp2a5 Sox11 Omp Ogn Prokr2 Ptn
## ACAACTATGGGTTGGCGG       0       0      1   0      0     1   0   0      0   2
## ACACAGATCCTGTTCTGA       1       1      0   0      0     0   6   0      0  22
## ACATCACCTGCGCGCTCT       0       0      1   2      0     6   0   0      0   0
## ACATTTAAGGCGCATGAT       0       0      0   1      0     0   1   0      1   1
## ACCACTGTAATCTCCCAT       0       0      0   1      1     0   1   0      0   1
## ACCAGAGCCGTTGAGCAA       0       0      0   0      0     1   3   0      0   2
## ACCCGGCGTAACTAGATA       0       1      0   1      0     3   0   0      1   2
## ACCGGAGTAAATTAGCGG       0       0      0   0      0     2   2   0      0   2
## ACCTGACAGCGGAAACTT       0       0      1   1      0     0   8   0      0   7
## ACGGAAATCAGTGGTATT       0       1      0   1      0     4   3   0      1   0
print(mobCorpus1$pos[1:10,])
##                         x      y
## ACAACTATGGGTTGGCGG 16.001 16.036
## ACACAGATCCTGTTCTGA 26.073 15.042
## ACATCACCTGCGCGCTCT 13.048 21.079
## ACATTTAAGGCGCATGAT 13.963 18.117
## ACCACTGTAATCTCCCAT 24.104 13.074
## ACCAGAGCCGTTGAGCAA  9.040 12.046
## ACCCGGCGTAACTAGATA 15.941 12.112
## ACCGGAGTAAATTAGCGG  7.949 16.058
## ACCTGACAGCGGAAACTT  9.039 13.047
## ACGGAAATCAGTGGTATT 20.959 15.073
preprocess can also be used to build a corpus using a specific list of genes:
## get list of genes of interest, for an example.
counts <- cleanCounts(counts = cd,
                      min.lib.size = 100,
                      min.reads = 1,
                      min.detected = 1)
圖片.png
odGenes <- getOverdispersedGenes(as.matrix(counts),
                      gam.k=5,
                      alpha=0.05,
                      plot=FALSE,
                      use.unadjusted.pvals=FALSE,
                      do.par=TRUE,
                      max.adjusted.variance=1e3,
                      min.adjusted.variance=1e-3,
                      verbose=FALSE, details=TRUE)
genes <- odGenes$ods
length(genes)
## build corpus using just the selected genes
mobCorpus2 <- preprocess(t(cd),
                       selected.genes = genes,
                       # can then proceed to filter this list, if desired
                       # min.reads = 1, 
                       min.lib.size = 1, # can still filter pixels
                       min.detected = 1, # can still filter to make sure the selected genes are present in at least 1 pixel
                       ODgenes = FALSE, # don't select the over dispersed genes
                       verbose = TRUE)
圖片.png

Selecting Optimal K

One limitation of LDA is that one must select the number of predicted cell-types (K) to be returned, a priori. Thus, one must either have knowledge of the number of cell-types present in the dataset of interest, or a way to select the model with the “optimal K”, or the model that best describes the dataset and captures the latent cell-types.
To do this, STdeconvolve fits a number of different LDA models with different K’s to the dataset and computes several different metrics to help guide users in the choice of K.
First, the perplexity of each fitted model is computed with respect to it’s K. This can be done using the entire real dataset the model was fitted to, or, users can chose a certain fraction of randomly selected pixels to be held out and used as a test set to compute model perplexity. The optimal K can either be the model with the K that returns the lowest perplexity (“min”), or we compute a “knee” metric (similar to choosing the number of principle components in PCA), which is the maximum second derivative, a reasonable choice for the inflection point (the elbow or knee).
Second, as K increases, the additional cell-types that are predicted tend to be represented present at small proportions in the pixels and thus contribute little to the predicted pixel cell-type profiles. To help put an upper limit on K, we also measure the number of predicted cell-types with mean pixel proportion less than 5%. After a certain K, the number of “rare” predicted cell-types, or those with low proportions across the pixels, steadily increases, suggesting that increasing K is no longer returning informative topics, or cell-types.
## fit LDA models to the corpus
ks <- seq(from = 2, to = 18, by = 1) # range of K's to fit LDA models with given the input corpus
ldas <- fitLDA(as.matrix(mobCorpus2$corpus),
               Ks = ks,
               ncores = parallel::detectCores(logical = TRUE) - 1, # number of cores to fit LDA models in parallel
               testSize = 0.2, # fraction of pixels to set aside for test corpus when computing perplexity
               plot=TRUE, verbose=FALSE)
圖片.png
While technically the lowest perplexity computed is when K=8, perplexity appears to be relatively stable between K=7 and K=18. Additionally, we expect there to be more than 4 cell-types and thus K should be greater than 4 (based on “knee” metric).
However, the number of cell-types with mean proportion <5% doesn’t start steadily increasing until around K=15, suggesting that the number of predicted cell-types are likely informative until this chosen K.
Once the LDA models are fitted, beta and theta matrices can be extracted for a given model. The simplest way to do this is with optimalModel() to get the specific model of interest:
## `optimalModel()` can take as arguments:
optimalModel(models = ldas, opt = "min") # "min" = K that resulted in minimum perplexity
## A LDA_VEM topic model with 8 topics.
optimalModel(models = ldas, opt = "kneed") # "kneed" = K that resulted in inflection perplexity
## A LDA_VEM topic model with 4 topics.
optimalModel(models = ldas, opt = 15) # or extract the model for any K that was used
## A LDA_VEM topic model with 15 topics.
Then, getBetaTheta() can be used to get the beta (cell-type gene expression profiles) and theta (pixel cell-type proportions) matrices.
results <- getBetaTheta(lda = optimalModel(models = ldas, opt = "15"),
                        corpus = mobCorpus2$corpus)
print(names(results))
## [1] "beta"  "theta"
或者落蝙, buildLDAobject() 是 getBetaTheta()织狐、clusterTopics() 和 combineTopics() 的包裝器。 具有相似基因表達(dá)譜的細(xì)胞類型被聚類筏勒,并提供了一組替代的“細(xì)胞類型cluster”移迫,其中細(xì)胞類型cluster是給定cluster內(nèi)細(xì)胞類型的聚合β和θ矩陣。
results <- buildLDAobject(LDAmodel = optimalModel(models = ldas, opt = "15"),
                          corpus = mobCorpus2$corpus,
                          deepSplit = 4,
                          colorScheme = "rainbow")
圖片.png
此處管行,結(jié)果是一個列表厨埋,其中包含單個預(yù)測細(xì)胞類型的 beta 和 theta 矩陣、細(xì)胞類型cluster的組合 beta 和 theta 矩陣捐顷。 “cluster”是指示每個預(yù)測細(xì)胞類型的指定cluster的因素荡陷,“樹狀”是聚集的細(xì)胞類型相對于它們預(yù)測的基因表達(dá)譜的樹狀圖。 “cols”和“clustCols”是指示細(xì)胞類型或cluster的顏色標(biāo)簽的因素迅涮,可選擇性地用于可視化目的废赞。

Clustering Cell-types

如上所述,預(yù)測的細(xì)胞類型可以基于它們預(yù)測的基因表達(dá)譜(β矩陣)被聚類為“細(xì)胞類型cluster”逗柴。 一組預(yù)測的細(xì)胞類型可能代表組織中較大細(xì)胞層的組成部分蛹头,因此將細(xì)胞類型聚集在一起以將該層可視化和評估為單個特征可能很有用。

如前面所示戏溺,最簡單的方法是使用 buildLDAobject(),它不僅返回單個細(xì)胞類型的 beta 和 theta 矩陣屠尊,還返回細(xì)胞類型cluster及其關(guān)聯(lián)的 beta 和 theta 矩陣旷祸。

然而,有了細(xì)胞類型的 beta 矩陣讼昆,可以通過直接調(diào)用 clusterTopics() 來對細(xì)胞類型進(jìn)行聚類托享,這允許指定執(zhí)行的聚類類型。 使用 R 包 dynamicTreeCut 使用動態(tài)樹拆分完成將細(xì)胞類型分配給特定cluster浸赫。
results <- getBetaTheta(lda = optimalModel(models = ldas, opt = "15"),
                        corpus = mobCorpus2$corpus)
clust <- clusterTopics(beta = results$beta,
                       clustering = "ward.D", # type of clustering
                       dynamic = "hybrid", # method to assign cell-types to clusters (see `dynamicTreeCut` options)
                       deepSplit = 4, # dynamic tree cutting sensitivity parameter
                       plot = TRUE)
圖片.png
To get the beta and theta matrices of the cell-type clusters, combineTopics() is used to aggregate the beta or theta matrices of the cell-types within assigned clusters:
betaCombn <- combineTopics(results$beta, clusters = clust$clusters, type = "b")
## [1] "cell-types combined."
thetaCombn <- combineTopics(results$theta, clusters = clust$clusters, type = "t")
## [1] "cell-types combined."
clust$clusters
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
##  2  1  5  1  3  4  3  5  2  3  4  2  1  1  1 
## Levels: 1 2 3 4 5
dim(betaCombn)
## [1]   5 345
dim(thetaCombn)
## [1] 260   5

可視化

results <- buildLDAobject(LDAmodel = optimalModel(models = ldas, opt = "15"),
                          corpus = mobCorpus2$corpus,
                          deepSplit = 4,
                          colorScheme = "rainbow",
                          plot = FALSE)
m <- results$theta
p <- pos
plt <- vizAllTopics(theta = m,
             pos = p,
             topicOrder=seq(ncol(m)),
             topicCols=rainbow(ncol(m)),
             groups = NA,
             group_cols = NA,
             r = 0.4, # size of scatterpies; adjust depending on the coordinates of the pixels
             lwd = 0.1,
             showLegend = TRUE,
             plotTitle = "K=15")
plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
plt
圖片.png
Scatterpie borders can be colored, either custom, or based on group membership
m <- results$theta
p <- pos
plt <- vizAllTopics(theta = m,
             pos = p,
             topicOrder=seq(ncol(m)),
             topicCols=rainbow(ncol(m)),
             groups = rep("0", dim(m)[1]),
             group_cols = c("0" = "black"),
             r = 0.4,
             lwd = 0.4, # adjust thickness of the scatterpie borders
             showLegend = TRUE,
             plotTitle = "K=15")
plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
plt
圖片.png
Based on group membership (let’s use the coarse cell layers of the MOB)
m <- results$theta
p <- pos
plt <- vizAllTopics(theta = m,
             pos = p,
             topicOrder=seq(ncol(m)),
             topicCols=rainbow(ncol(m)),
             groups = annot, 
             group_cols = rainbow(length(levels(annot))),
             r = 0.4,
             lwd = 0.4, # adjust thickness of the scatterpie borders
             showLegend = TRUE,
             plotTitle = "K=15")
plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
plt
圖片.png
Now let’s visualize the cell-type clusters:
m <- results$thetaCombn
p <- pos
plt <- vizAllTopics(theta = m,
             pos = p,
             topicOrder=seq(ncol(m)),
             topicCols=rainbow(ncol(m)),
             groups = rep("0", dim(m)[1]),
             group_cols = c("0" = "black"),
             r = 0.4,
             lwd = 0.2,
             showLegend = TRUE,
             plotTitle = "K=15 cell-type clusters")
# plt <- plt + ggplot2::guides(fill=guide_legend(ncol=2))
plt
圖片.png

Individual Cell-types and Cell-type clusters

m <- results$theta
p <- pos[rownames(results$theta),]
vizTopicClusters(theta = m,
                 pos = p,
                 clusters = results$cols,
                 sharedCol = TRUE, # cell-types in a cluster share the same color
                 groups = rep("0", dim(m)[1]),
                 group_cols = c("0" = "black"),
                 r = 0.4,
                 lwd = 0.3,
                 showLegend = TRUE)
圖片.png

圖片.png

圖片.png

圖片.png

圖片.png
m <- results$theta
p <- pos[rownames(results$theta),]
vizTopicClusters(theta = m,
                 pos = p,
                 clusters = results$cols,
                 sharedCol = FALSE, # cell-types in a cluster on a color gradient
                 groups = rep("0", dim(m)[1]),
                 group_cols = c("0" = "black"),
                 r = 0.4,
                 lwd = 0.3,
                 showLegend = TRUE)
圖片.png

圖片.png

圖片.png

圖片.png

圖片.png
m <- results$theta[,c("14", "12")]
p <- pos
other <- 1 - rowSums(m)
m <- cbind(m, other)
colnames(m) <- c("14", "12", "Other")
vizAllTopics(theta = m,
             pos = p,
             topicOrder=seq(ncol(m)),
             topicCols=c(transparentCol("red", percent = 50), # BONUS: can make colors transparent if overlaying on top of an image
                         "black",
                         "white"), 
             groups = rep("0", dim(m)[1]),
             group_cols = c("0" = "white"), # make scatterpie borders white to focus directly on the cell-type.
             r = 0.4,
             lwd = 0.1,
             showLegend = TRUE,
             overlay = NA) # BONUS: plot the scatterpies on top of a raster image of the H&E tissue, if set equal to the rgb matrix
圖片.png

非常好的方法闰围,很值得大家借鑒

生活很好,有你更好

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
禁止轉(zhuǎn)載既峡,如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者羡榴。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市运敢,隨后出現(xiàn)的幾起案子校仑,更是在濱河造成了極大的恐慌忠售,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件迄沫,死亡現(xiàn)場離奇詭異稻扬,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)羊瘩,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進(jìn)店門泰佳,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人尘吗,你說我怎么就攤上這事逝她。” “怎么了摇予?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵汽绢,是天一觀的道長。 經(jīng)常有香客問我侧戴,道長宁昭,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任酗宋,我火速辦了婚禮积仗,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘蜕猫。我一直安慰自己寂曹,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布回右。 她就那樣靜靜地躺著隆圆,像睡著了一般。 火紅的嫁衣襯著肌膚如雪翔烁。 梳的紋絲不亂的頭發(fā)上渺氧,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天,我揣著相機(jī)與錄音蹬屹,去河邊找鬼侣背。 笑死,一個胖子當(dāng)著我的面吹牛慨默,可吹牛的內(nèi)容都是我干的贩耐。 我是一名探鬼主播,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼厦取,長吁一口氣:“原來是場噩夢啊……” “哼潮太!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起蒜胖,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤消别,失蹤者是張志新(化名)和其女友劉穎抛蚤,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體寻狂,經(jīng)...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡岁经,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了蛇券。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片缀壤。...
    茶點故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖纠亚,靈堂內(nèi)的尸體忽然破棺而出塘慕,到底是詐尸還是另有隱情,我是刑警寧澤蒂胞,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布图呢,位于F島的核電站,受9級特大地震影響骗随,放射性物質(zhì)發(fā)生泄漏蛤织。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一鸿染、第九天 我趴在偏房一處隱蔽的房頂上張望指蚜。 院中可真熱鬧,春花似錦涨椒、人聲如沸摊鸡。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽免猾。三九已至,卻和暖如春囤热,著一層夾襖步出監(jiān)牢的瞬間掸刊,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工赢乓, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人石窑。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓牌芋,卻偏偏與公主長得像,于是被迫代替她去往敵國和親松逊。 傳聞我的和親對象是個殘疾皇子躺屁,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,700評論 2 345

推薦閱讀更多精彩內(nèi)容