10X空間轉(zhuǎn)錄組和單細胞轉(zhuǎn)錄組都在如火如荼的進行當中,單細胞提供了單個細胞的精度來研究組織矿酵,而空間轉(zhuǎn)錄組則是提供了細胞類型在組織中的具體位置婴梧,精度和空間位置幾乎具有同等的研究價值,而兩種技術上的聯(lián)合分析正是優(yōu)勢互補的選擇辨绊,而且也是一種挑戰(zhàn)山林。目前聯(lián)合分析的方法已經(jīng)有了好幾個,包括Seurat邢羔、scanpy等驼抹,但目前而言,利用的情況很少拜鹤,今天我們來分享一下一種新的聯(lián)合分析方法----cell2location框冀。文獻在這里Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics,今天我們的任務就是來參透這個方法,首先我們來分享文獻敏簿。
Abstract
組織中細胞類型的空間位置從根本上塑造了細胞之間的相互作用和功能明也,but the high-throughput spatial mapping of complex tissues remains a challenge。We present сell2location, a principled and versatile Bayesian model(貝葉斯模型) that integrates single-cell and spatial transcriptomics to map cell types in situ in a comprehensive manner惯裕。在準確性和全面性的方面温数,cell2location的表現(xiàn)優(yōu)異,In the mouse brain, we use a new paired single nucleus and spatial RNA-sequencing dataset to map dozens of cell types and identify tissue regions in an automated manner蜻势。We discover novel regional astrocyte subtypes including fine subpopulations in the thalamus and hypothalamus(新的發(fā)現(xiàn))撑刺。In the human lymph node, we resolve spatially interlaced immune cell states and identify co-located groups of cells underlying tissue organisation.(細胞共定位)。我們在空間上繪制罕見的萌發(fā)前中心B細胞種群握玛,并預測與干擾素反應相關的推定細胞相互作用够傍「Σぃ總之方法很好用。
這里我們需要注意的一點就是冕屯,貝葉斯模型寂诱,這個模型在建模的時候很常用,這里就不多介紹了安聘,推薦大家看一本書《機器學習原理痰洒、算法與應用》,書中講述了很多有關機器學習的算法和基礎知識浴韭,有利于我們加深生信分析的算法原理带迟。
Introduction
The cellular architecture of tissues, where distinct cell types are organized in space, underlies cell-cell communication, organ function and pathology.(組織是一個復雜的統(tǒng)一體)。Emerging spatial genomics technologies hold considerable promise for characterising tissue architecture, providing key opportunities to map resident cell types and cell signalling in situ, thereby helping guide in vitro tissue engineering efforts.(空間轉(zhuǎn)錄組的主要應用價值)囱桨。但是空間轉(zhuǎn)錄組仍然存在挑戰(zhàn),One reason is the enormous variation in tissue architecture across organs, ranging from the brain with hundreds of cell types found across discrete anatomical regions to immune organs with continuous cellular gradients and dynamically modified microenvironments嗅绰。To create and map comprehensive tissue atlases, experimental and computational methods need to be aligned to cope with this variation and in particular, enable mapping numerous resident cell types across diverse and complex tissues in situ.(技術挑戰(zhàn))舍肠。
coupled single-cell and spatially resolved transcriptomics offer a scalable approach to address these challenges(單細胞和空間轉(zhuǎn)錄組的技術互補)。首先第一步要從解離的組織中識別各種細胞類型(單細胞轉(zhuǎn)錄組)窘面,然后匹配各個細胞類型的空間位置分布翠语。目前的挑戰(zhàn)是First, spatial RNA-seq measurements (i.e. locations) combine multiple cell types as array-based mRNA capture currently do not match cellular boundaries in tissues. Thus, each spatial position corresponds to either several cell types (Visium, Tomo-Seq) or fractions of multiple cell types (Slide-Seq, HDST). Second, spatial RNA-seq measurements are confounded by different sources of variation as 1) cell numbers vary across tissue positions, 2) different cells and cell types differ in total mRNA content, and 3) thin tissue sectioning captures variable fractions of each cell’s volume. Computational approaches need to appropriately model and account for all of these factors。
Here, we present cell2location, a principled and versatile Bayesian model for comprehensive mapping of cell types in spatial transcriptomic data.(我們關注的重點)Cell2location uses reference gene expression signatures of cell types derived from scRNA-seq to decompose multi-cell spatial transcriptomic data into cell type abundance maps(簡單的原理與其他方法相同财边,算法有差異)肌括。The model accurately maps complex tissues, including rare cell types and fine subtypes, and it identifies tissue regions and co-located cell types downstream in an automated manner(能夠識別共定位的細胞類型,這個很重要)酣难。下面是兩個應用案例谍夭,證明這個方法好。
Result
(1)Cell2location: a Bayesian model for spatial mapping of cell types
Cell2location maps the spatial distribution of cell types by integrating single-cell RNAseq (scRNA-seq) and multi-cell spatial transcriptomic data from a given tissue憨募。
從原理圖上來看紧索,單細胞作為參考,匹配細胞類型的空間位置菜谣,這個方向無可改變珠漂。
首先第一步:利用模型估計單細胞數(shù)據(jù)的細胞類型的表達特征。例如尾膊,通過使用常規(guī)聚類來識別細胞類型和亞群媳危,然后估算平均聚類基因表達譜而獲得的結(jié)果(如下圖)
第二步:cell2location decomposes mRNA counts in spatial transcriptomic data using these reference signatures, thereby estimating the relative and absolute abundance of each cell type at each spatial location滋觉。(分解數(shù)據(jù))签夭。
Cell2location被實現(xiàn)為可解釋的分層貝葉斯模型,thereby (1) providing principled means to account for model uncertainty, (2) accounting for linear dependencies in cell type abundances, (3) modelling differences in measurement sensitivity across technologies, and (4) accounting for unexplained/residual variation by employing a flexible count-based error model. Finally, (5) cell2location is computationally efficient, owing to variational approximate inference and GPU acceleration椎侠。(這些方法我們下一篇分享解析)第租。
To validate cell2location, we initially used simulated data that reflects diverse cell abundance and spatial patterns。(作者模擬了空間轉(zhuǎn)錄組數(shù)據(jù))我纪。
這里我們需要注意的是Jensen–Shannon divergence慎宾,也就是J-S散度,數(shù)學的內(nèi)容我們下面講解浅悉。
Briefly, we simulated a spatial transcriptomics dataset with 2,000 locations, based on reference cell-type annotations obtained from a mouse brain snRNA-seq reference dataset including 46 cell types趟据,Multi-cell gene expression profiles at each location were derived by combining cells drawn from different reference cell types, using one of four cell abundance patterns with variable density and sparsity distribution that mimics the patterns observed in real data。然后運用cell2location進行分析术健,得到圖中的結(jié)果汹碱。基本上有很高的相關性荞估,但是這里有一個問題咳促,那就是模擬的空間轉(zhuǎn)錄組數(shù)據(jù)是依據(jù)單細胞數(shù)據(jù)合并而來,一旦真正的空間轉(zhuǎn)錄組數(shù)據(jù)含有某些單細胞不存在的細胞類型(比如說技術壁壘勘伺,10X單細胞捕獲中性粒細胞結(jié)果很差)跪腹,那么預測的結(jié)果很可能出現(xiàn)錯誤,我們往后看看飞醉,是否作者提到這個問題冲茸。
Next, we compared cell2location to recently proposed alternative methods for the inference of relative cell-type abundance from spatial transcriptomics。一樣的文獻結(jié)果缅帘,自己的軟件表現(xiàn)最好轴术。并且該模型還產(chǎn)生了相對細胞類型豐度的更準確估計。
這里我們需要注意的是钦无,PR曲線膳音,這些數(shù)學上的問題我們下面講解。
cell2location not only provides estimates of relative cell type fractions but additionally estimates absolute cell type abundance, which can be interpreted as the number of cells that express a reference cell type signature at a given location, which again were highly concordant with the simulated ground truth(估計細胞數(shù)量铃诬,這個也很重要)祭陷。
總之,these results support that cell2location can accurately estimate cell abundance across diverse cell types.
然后文章用了兩個例子趣席,運用該軟見解決我們的聯(lián)合分析問題兵志。具體案例我們這里就不多說了,我們需要更多的是算法的原理宣肚。
我們首先解決一下J-S散度和PR曲線想罕。
Jensen-Shannon divergence(J-S散度) is a method of measuring the similarity between two probability distributions。這個我們需要先知道一下KL散度。
KL散度又稱為相對熵按价,信息散度惭适,信息增益。KL散度是是兩個概率分布P和Q 差別的非對稱性的度量楼镐。 KL
散度是用來 度量使用基于Q的編碼來編碼來自P的樣本平均所需的額外的位元數(shù)癞志。 典型情況下,P表示數(shù)據(jù)的真實分布框产,Q表示數(shù)據(jù)的理論分布凄杯,模型分布,或P的近似分布秉宿。
定義如下:
因為對數(shù)函數(shù)是凸函數(shù)戒突,所以 KL散度的值為非負數(shù)。
-
JS散度(Jensen-Shannon)
JS散度度量了兩個概率分布的相似度描睦,基于KL散度的變體膊存,解決了KL散度非對稱的問題。一般地忱叭,JS散度是對稱的隔崎,其取值是0到1之間。定義如下:
也就是圖B 的結(jié)果窑多。
PR曲線
相對于PR曲線,ROC曲線了解的更多一些洼滚,大家可以參考我關于ROC曲線的講解深入理解R包AUcell對于分析單細胞的作用.
而PR曲線
PR曲線實則是以precision(精準率)和recall(召回率)這兩個為變量而做出的曲線埂息,其中recall為橫坐標,precision為縱坐標遥巴。
那么問題來了千康,什么是精準率?什么是召回率铲掐?這里先做一個解釋拾弃。
在二分類問題中,分類器將一個實例的分類標記為是或否摆霉,可以用一個混淆矩陣來表示豪椿,如下圖所示。注:把正例正確地分類為正例携栋,表示為TP(true positive)搭盾,把正例錯誤地分類為負例,表示為FN(false negative)婉支。
把負例正確地分類為負例鸯隅,表示為TN(true negative), 把負例錯誤地分類為正例向挖,表示為FP(false positive)蝌以。
【舉個栗子:A是只貓(正例)炕舵,B是只倉鼠(負例),A在二分類中被劃分為貓則為TP跟畅,被劃分為倉鼠則為FN咽筋。B在二分類中被劃分為倉鼠則為TN,被劃分為貓則為碍彭∥钏叮】
從混淆矩陣可以得出精準率與召回率:precision = TP/(TP + FP), recall = TP/(TP +FN)(注意:分子相同。)接下來補充一個重點:
一條PR曲線要對應一個閾值庇忌。通過選擇合適的閾值舞箍,比如50%,對樣本進行劃分皆疹,概率大于50%的就認為是正例疏橄,小于50%的就是負例,從而計算相應的精準率和召回率。舉個例子如下:(true這列表示正例或者負例略就,hyp這列表示閾值0.5的情況下捎迫,概率是否大于0.5)
那么根據(jù)這個表格我們可以計算:TP=6,F(xiàn)N=0表牢,F(xiàn)P=2窄绒,TN=2。故recall=6/(6+0)=1,precison=6/(6+2)=0.75崔兴,那么得出坐標(1彰导,0.75)。同理得到不同閾下的坐標敲茄,即可繪制出曲線位谋。
PR曲線如下:如果一個學習器的P-R曲線被另一個學習器的P-R曲線完全包住,則可斷言后者的性能優(yōu)于前者堰燎,例如上面的A和B優(yōu)于學習器C掏父。但是A和B的性能無法直接判斷,我們可以根據(jù)曲線下方的面積大小來進行比較秆剪,但更常用的是平衡點或者是F1值赊淑。平衡點(BEP)是P=R時的取值,如果這個值較大仅讽,則說明學習器的性能較好膏燃。而F1 = 2 * P * R /( P + R ),同樣何什,F(xiàn)1值越大组哩,我們可以認為該學習器的性能較好。
部分資料參考:二戰(zhàn)周志華《機器學習》-PR曲線和ROC曲線
P-R曲線深入理解
兩種曲線我們都需要了解一下,以免以后遇到不知道就尷尬了~~~
接下來我們來看cell2location的模型伶贰。
模型的簡單介紹
For a complete derivation of the cell2location model, please see supplementary computational methods. Briefly, cell2location is a Bayesian model, which estimates absolute cell density of cell types by decomposing mRNA counts ??s,g of each gene ?? = {1, . . , ??} at locations ?? = {1, . . , ??} into a set of predefined reference signatures of cell types gf g.For 10X Visium data, this matrix can be directly obtained from the 10X SpaceRanger software and imported into data format used in a popular python package Scanpy(利用scanpy來讀取10X分析數(shù)據(jù)蛛砰,也可以聯(lián)合Suerat進行分析)。ds,g should be fltered to a set of genes expressed in the single cell reference g f g.這個地方的處理在于單細胞與空間轉(zhuǎn)錄組映射的時候黍衙,表達基因的相同泥畅。cell2location的圖表模型如下圖:
Let G = {gf,g}, denote an F X G matrix of reference cell type signatures, which consist of F = {1,..., F} gene expression profiles Gf,: for g = {1,...,G} genes, representing average expression of each gene in each cell type in linear mRNA counts space (not log-space).This matrix needs to be provided to cell2location and can be estimated from scRNA-seq profles.這個地方我們可以看到,對各個細胞類型的基因表達求平均值來代表這個細胞類型琅翻。Cell2location models the elements of D as Negative Binomial distributed,這個地方稍微說一下負二項分布位仁,
負二項分布是統(tǒng)計學上一種離散概率分布。滿足以下條件的稱為負二項分布:實驗包含一系列獨立的實驗方椎, 每個實驗都有成功聂抢、失敗兩種結(jié)果,成功的概率是恒定的棠众,實驗持續(xù)到r次不成功琳疏,r為[正整數(shù)]≌⒛茫可以參考百度百科負二項分布,不過從這里開始空盼,開始涉及到很深的數(shù)學只是背景,本人數(shù)學不會新荤,但沒有因此而驕傲過揽趾,所以希望有數(shù)學的大牛來分享一下內(nèi)容。
最后展示一下分析的結(jié)果苛骨,
看起來相當不錯篱瞎。大家可以嘗試。