10X單細胞和空間聯(lián)合分析的方法---cell2location

10X空間轉(zhuǎn)錄組和單細胞轉(zhuǎn)錄組都在如火如荼的進行當中，單細胞提供了單個細胞的精度來研究組織矿酵，而空間轉(zhuǎn)錄組則是提供了細胞類型在組織中的具體位置婴梧，精度和空間位置幾乎具有同等的研究價值，而兩種技術上的聯(lián)合分析正是優(yōu)勢互補的選擇辨绊，而且也是一種挑戰(zhàn)山林。目前聯(lián)合分析的方法已經(jīng)有了好幾個，包括Seurat邢羔、scanpy等驼抹，但目前而言，利用的情況很少拜鹤，今天我們來分享一下一種新的聯(lián)合分析方法----cell2location框冀。文獻在這里Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics,今天我們的任務就是來參透這個方法，首先我們來分享文獻敏簿。

Abstract

組織中細胞類型的空間位置從根本上塑造了細胞之間的相互作用和功能明也，but the high-throughput spatial mapping of complex tissues remains a challenge。We present сell2location, a principled and versatile Bayesian model（貝葉斯模型） that integrates single-cell and spatial transcriptomics to map cell types in situ in a comprehensive manner惯裕。在準確性和全面性的方面温数，cell2location的表現(xiàn)優(yōu)異，In the mouse brain, we use a new paired single nucleus and spatial RNA-sequencing dataset to map dozens of cell types and identify tissue regions in an automated manner蜻势。We discover novel regional astrocyte subtypes including fine subpopulations in the thalamus and hypothalamus（新的發(fā)現(xiàn)）撑刺。In the human lymph node, we resolve spatially interlaced immune cell states and identify co-located groups of cells underlying tissue organisation.（細胞共定位）。我們在空間上繪制罕見的萌發(fā)前中心B細胞種群握玛，并預測與干擾素反應相關的推定細胞相互作用够傍「Σぃ總之方法很好用。
這里我們需要注意的一點就是冕屯，貝葉斯模型寂诱，這個模型在建模的時候很常用，這里就不多介紹了安聘，推薦大家看一本書《機器學習原理痰洒、算法與應用》，書中講述了很多有關機器學習的算法和基礎知識浴韭，有利于我們加深生信分析的算法原理带迟。

Introduction

The cellular architecture of tissues, where distinct cell types are organized in space, underlies cell-cell communication, organ function and pathology.（組織是一個復雜的統(tǒng)一體）。Emerging spatial genomics technologies hold considerable promise for characterising tissue architecture, providing key opportunities to map resident cell types and cell signalling in situ, thereby helping guide in vitro tissue engineering efforts.（空間轉(zhuǎn)錄組的主要應用價值）囱桨。但是空間轉(zhuǎn)錄組仍然存在挑戰(zhàn)，One reason is the enormous variation in tissue architecture across organs, ranging from the brain with hundreds of cell types found across discrete anatomical regions to immune organs with continuous cellular gradients and dynamically modified microenvironments嗅绰。To create and map comprehensive tissue atlases, experimental and computational methods need to be aligned to cope with this variation and in particular, enable mapping numerous resident cell types across diverse and complex tissues in situ.（技術挑戰(zhàn)）舍肠。
coupled single-cell and spatially resolved transcriptomics offer a scalable approach to address these challenges（單細胞和空間轉(zhuǎn)錄組的技術互補）。首先第一步要從解離的組織中識別各種細胞類型（單細胞轉(zhuǎn)錄組）窘面，然后匹配各個細胞類型的空間位置分布翠语。目前的挑戰(zhàn)是First, spatial RNA-seq measurements (i.e. locations) combine multiple cell types as array-based mRNA capture currently do not match cellular boundaries in tissues. Thus, each spatial position corresponds to either several cell types (Visium, Tomo-Seq) or fractions of multiple cell types (Slide-Seq, HDST). Second, spatial RNA-seq measurements are confounded by different sources of variation as 1) cell numbers vary across tissue positions, 2) different cells and cell types differ in total mRNA content, and 3) thin tissue sectioning captures variable fractions of each cell’s volume. Computational approaches need to appropriately model and account for all of these factors。
Here, we present cell2location, a principled and versatile Bayesian model for comprehensive mapping of cell types in spatial transcriptomic data.（我們關注的重點）Cell2location uses reference gene expression signatures of cell types derived from scRNA-seq to decompose multi-cell spatial transcriptomic data into cell type abundance maps（簡單的原理與其他方法相同财边，算法有差異）肌括。The model accurately maps complex tissues, including rare cell types and fine subtypes, and it identifies tissue regions and co-located cell types downstream in an automated manner（能夠識別共定位的細胞類型，這個很重要）酣难。下面是兩個應用案例谍夭，證明這個方法好。

Result

（1）Cell2location: a Bayesian model for spatial mapping of cell types

Cell2location maps the spatial distribution of cell types by integrating single-cell RNAseq (scRNA-seq) and multi-cell spatial transcriptomic data from a given tissue憨募。

圖片.png

從原理圖上來看紧索，單細胞作為參考，匹配細胞類型的空間位置菜谣，這個方向無可改變珠漂。
首先第一步：利用模型估計單細胞數(shù)據(jù)的細胞類型的表達特征。例如尾膊，通過使用常規(guī)聚類來識別細胞類型和亞群媳危，然后估算平均聚類基因表達譜而獲得的結(jié)果（如下圖）

圖片.png

，我們需要逐步分析冈敛。Cell2location基于負二項式回歸實現(xiàn)此估算步驟待笑，從而可以跨技術和批次可靠地組合數(shù)據(jù)。（又是數(shù)學）抓谴。
第二步：cell2location decomposes mRNA counts in spatial transcriptomic data using these reference signatures, thereby estimating the relative and absolute abundance of each cell type at each spatial location滋觉。（分解數(shù)據(jù)）签夭。
Cell2location被實現(xiàn)為可解釋的分層貝葉斯模型，thereby (1) providing principled means to account for model uncertainty, (2) accounting for linear dependencies in cell type abundances, (3) modelling differences in measurement sensitivity across technologies, and (4) accounting for unexplained/residual variation by employing a flexible count-based error model. Finally, (5) cell2location is computationally efficient, owing to variational approximate inference and GPU acceleration椎侠。（這些方法我們下一篇分享解析）第租。
To validate cell2location, we initially used simulated data that reflects diverse cell abundance and spatial patterns。（作者模擬了空間轉(zhuǎn)錄組數(shù)據(jù)）我纪。

圖片.png

這里我們需要注意的是Jensen–Shannon divergence慎宾，也就是J-S散度，數(shù)學的內(nèi)容我們下面講解浅悉。
Briefly, we simulated a spatial transcriptomics dataset with 2,000 locations, based on reference cell-type annotations obtained from a mouse brain snRNA-seq reference dataset including 46 cell types趟据，Multi-cell gene expression profiles at each location were derived by combining cells drawn from different reference cell types, using one of four cell abundance patterns with variable density and sparsity distribution that mimics the patterns observed in real data。然后運用cell2location進行分析术健，得到圖中的結(jié)果汹碱。基本上有很高的相關性荞估，但是這里有一個問題咳促，那就是模擬的空間轉(zhuǎn)錄組數(shù)據(jù)是依據(jù)單細胞數(shù)據(jù)合并而來，一旦真正的空間轉(zhuǎn)錄組數(shù)據(jù)含有某些單細胞不存在的細胞類型（比如說技術壁壘勘伺，10X單細胞捕獲中性粒細胞結(jié)果很差）跪腹，那么預測的結(jié)果很可能出現(xiàn)錯誤，我們往后看看飞醉，是否作者提到這個問題冲茸。
Next, we compared cell2location to recently proposed alternative methods for the inference of relative cell-type abundance from spatial transcriptomics。一樣的文獻結(jié)果缅帘，自己的軟件表現(xiàn)最好轴术。并且該模型還產(chǎn)生了相對細胞類型豐度的更準確估計。

圖片.png

這里我們需要注意的是钦无，PR曲線膳音，這些數(shù)學上的問題我們下面講解。
cell2location not only provides estimates of relative cell type fractions but additionally estimates absolute cell type abundance, which can be interpreted as the number of cells that express a reference cell type signature at a given location, which again were highly concordant with the simulated ground truth（估計細胞數(shù)量铃诬，這個也很重要）祭陷。

圖片.png

總之，these results support that cell2location can accurately estimate cell abundance across diverse cell types.
然后文章用了兩個例子趣席，運用該軟見解決我們的聯(lián)合分析問題兵志。具體案例我們這里就不多說了，我們需要更多的是算法的原理宣肚。

我們首先解決一下J-S散度和PR曲線想罕。

Jensen-Shannon divergence（J-S散度） is a method of measuring the similarity between two probability distributions。這個我們需要先知道一下KL散度。

KL散度又稱為相對熵按价，信息散度惭适，信息增益。KL散度是是兩個概率分布P和Q 差別的非對稱性的度量楼镐。 KL
散度是用來度量使用基于Q的編碼來編碼來自P的樣本平均所需的額外的位元數(shù)癞志。典型情況下，P表示數(shù)據(jù)的真實分布框产，Q表示數(shù)據(jù)的理論分布凄杯，模型分布，或P的近似分布秉宿。
定義如下：

圖片.png

因為對數(shù)函數(shù)是凸函數(shù)戒突，所以 KL散度的值為非負數(shù)。

JS散度(Jensen-Shannon)
JS散度度量了兩個概率分布的相似度描睦，基于KL散度的變體膊存，解決了KL散度非對稱的問題。一般地忱叭，JS散度是對稱的隔崎，其取值是0到1之間。定義如下：

圖片.png

也就是圖B 的結(jié)果窑多。

PR曲線

相對于PR曲線，ROC曲線了解的更多一些洼滚，大家可以參考我關于ROC曲線的講解深入理解R包AUcell對于分析單細胞的作用.
而PR曲線

PR曲線實則是以precision（精準率）和recall（召回率）這兩個為變量而做出的曲線埂息，其中recall為橫坐標，precision為縱坐標遥巴。
那么問題來了千康，什么是精準率？什么是召回率铲掐？這里先做一個解釋拾弃。
在二分類問題中，分類器將一個實例的分類標記為是或否摆霉，可以用一個混淆矩陣來表示豪椿，如下圖所示。

image

注：把正例正確地分類為正例携栋，表示為TP（true positive）搭盾，把正例錯誤地分類為負例，表示為FN（false negative）婉支。
把負例正確地分類為負例鸯隅，表示為TN（true negative），把負例錯誤地分類為正例向挖，表示為FP（false positive）蝌以。
【舉個栗子：A是只貓（正例）炕舵，B是只倉鼠（負例），A在二分類中被劃分為貓則為TP跟畅，被劃分為倉鼠則為FN咽筋。B在二分類中被劃分為倉鼠則為TN，被劃分為貓則為碍彭∥钏叮】
從混淆矩陣可以得出精準率與召回率：precision = TP/(TP + FP), recall = TP/(TP +FＮ)（注意：分子相同。）

接下來補充一個重點：
一條PR曲線要對應一個閾值庇忌。通過選擇合適的閾值舞箍，比如50%，對樣本進行劃分皆疹，概率大于50%的就認為是正例疏橄，小于50%的就是負例,從而計算相應的精準率和召回率。

舉個例子如下：(true這列表示正例或者負例略就，hyp這列表示閾值0.5的情況下捎迫，概率是否大于0.5)

image

那么根據(jù)這個表格我們可以計算：TP=6，F(xiàn)N=0表牢，F(xiàn)P=2窄绒，TN=2。故recall=6/(6+0)=1,precison=6/(6+2)=0.75崔兴，那么得出坐標（1彰导，0.75）。同理得到不同閾下的坐標敲茄，即可繪制出曲線位谋。
PR曲線如下：

image

如果一個學習器的P-R曲線被另一個學習器的P-R曲線完全包住，則可斷言后者的性能優(yōu)于前者堰燎，例如上面的A和B優(yōu)于學習器C掏父。但是A和B的性能無法直接判斷，我們可以根據(jù)曲線下方的面積大小來進行比較秆剪，但更常用的是平衡點或者是F1值赊淑。平衡點（BEP）是P=R時的取值，如果這個值較大仅讽，則說明學習器的性能較好膏燃。而F1 = 2 * P * R ／( P + R )，同樣何什，F(xiàn)1值越大组哩，我們可以認為該學習器的性能較好。
部分資料參考：二戰(zhàn)周志華《機器學習》-PR曲線和ROC曲線
 P-R曲線深入理解
兩種曲線我們都需要了解一下，以免以后遇到不知道就尷尬了~~~

接下來我們來看cell2location的模型伶贰。

模型的簡單介紹
For a complete derivation of the cell2location model, please see supplementary computational methods. Briefly, cell2location is a Bayesian model, which estimates absolute cell density of cell types by decomposing mRNA counts ??_s,g of each gene ?? = {1, . . , ??} at locations ?? = {1, . . , ??} into a set of predefined reference signatures of cell types g_{f g}.For 10X Visium data, this matrix can be directly obtained from the 10X SpaceRanger software and imported into data format used in a popular python package Scanpy(利用scanpy來讀取10X分析數(shù)據(jù)蛛砰，也可以聯(lián)合Suerat進行分析)。d_s,g should be fltered to a set of genes expressed in the single cell reference g _{f g}.這個地方的處理在于單細胞與空間轉(zhuǎn)錄組映射的時候黍衙，表達基因的相同泥畅。cell2location的圖表模型如下圖：

圖片.png

Let G = {g_f,g}, denote an F X G matrix of reference cell type signatures, which consist of F = {1,..., F} gene expression profiles G_f,: for g = {1,...,G} genes, representing average expression of each gene in each cell type in linear mRNA counts space (not log-space).This matrix needs to be provided to cell2location and can be estimated from scRNA-seq profles.這個地方我們可以看到，對各個細胞類型的基因表達求平均值來代表這個細胞類型琅翻。Cell2location models the elements of D as Negative Binomial distributed,這個地方稍微說一下負二項分布位仁，
負二項分布是統(tǒng)計學上一種離散概率分布。滿足以下條件的稱為負二項分布：實驗包含一系列獨立的實驗方椎，每個實驗都有成功聂抢、失敗兩種結(jié)果，成功的概率是恒定的棠众，實驗持續(xù)到r次不成功琳疏，r為[正整數(shù)]≌⒛茫可以參考百度百科負二項分布,不過從這里開始空盼，開始涉及到很深的數(shù)學只是背景，本人數(shù)學不會新荤，但沒有因此而驕傲過揽趾，所以希望有數(shù)學的大牛來分享一下內(nèi)容。
最后展示一下分析的結(jié)果苛骨，

圖片.png

看起來相當不錯篱瞎。大家可以嘗試。

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

禁止轉(zhuǎn)載智袭，如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者奔缠。

人面猴
序言：七十年代末掠抬，一起剝皮案震驚了整個濱河市吼野，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌两波，老刑警劉巖瞳步，帶你破解...
沈念sama閱讀 206,013評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異腰奋，居然都是意外死亡单起，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門劣坊，熙熙樓的掌柜王于貴愁眉苦臉地迎上來嘀倒，“玉大人，你說我怎么就攤上這事〔饽ⅲ” “怎么了灌危？”我有些...
開封第一講書人閱讀 152,370評論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長碳胳。經(jīng)常有香客問我勇蝙，道長，這世上最難降的妖魔是什么挨约？我笑而不...
開封第一講書人閱讀 55,168評論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任味混，我火速辦了婚禮，結(jié)果婚禮上诫惭，老公的妹妹穿的比我還像新娘翁锡。我一直安慰自己，他們只是感情好贝攒，可當我...
茶點故事閱讀 64,153評論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布盗誊。她就那樣靜靜地躺著，像睡著了一般隘弊。火紅的嫁衣襯著肌膚如雪哈踱。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 48,954評論 1贊 283
城市分裂傳說
那天梨熙，我揣著相機與錄音开镣，去河邊找鬼。笑死咽扇，一個胖子當著我的面吹牛邪财，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播质欲，決...
沈念sama閱讀 38,271評論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼树埠，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了嘶伟？” 一聲冷哼從身側(cè)響起怎憋，我...
開封第一講書人閱讀 36,916評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎九昧，沒想到半個月后绊袋，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,382評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡铸鹰，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 35,877評論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年癌别，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蹋笼。...
茶點故事閱讀 37,989評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡展姐，死狀恐怖躁垛，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情圾笨，我是刑警寧澤缤苫，帶...
沈念sama閱讀 33,624評論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站墅拭，受9級特大地震影響活玲，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜谍婉，卻給世界環(huán)境...
茶點故事閱讀 39,209評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一舒憾、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧穗熬，春花似錦镀迂、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評論 0贊 19
一樁弒父案探遵，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至妓柜，卻和暖如春箱季，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背棍掐。一陣腳步聲響...
開封第一講書人閱讀 31,418評論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工藏雏，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人作煌。一個月前我還...
沈念sama閱讀 45,401評論 2贊 352
代替公主和親
正文我出身青樓掘殴，卻偏偏與公主長得像，于是被迫代替她去往敵國和親粟誓。傳聞我的和親對象是個殘疾皇子奏寨，可洞房花燭夜當晚...
茶點故事閱讀 42,700評論 2贊 345

10X單細胞和空間聯(lián)合分析的方法---cell2location

Abstract

Introduction

Result

（1）Cell2location: a Bayesian model for spatial mapping of cell types

Jensen-Shannon divergence（J-S散度） is a method of measuring the similarity between two probability distributions。這個我們需要先知道一下KL散度。

PR曲線

接下來我們來看cell2location的模型伶贰。

推薦閱讀更多精彩內(nèi)容