10X單細(xì)胞（10X空間轉(zhuǎn)錄組）聚類分析之scDCC

hi，各位好，今天我們努努力，看一下10X單細(xì)胞和10X空間轉(zhuǎn)錄組普遍存在的dropout現(xiàn)象對(duì)我們數(shù)據(jù)分析的影響和文章中的方法scDCC是如何規(guī)避的窜醉，文章在Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data,2021年3月發(fā)表于NC，中國人發(fā)表的艺谆，不算低了榨惰，有關(guān)dropout的知識(shí)，大家可以看我之前分享的文章深度學(xué)習(xí)中Dropout原理解析（10X單細(xì)胞和10X空間轉(zhuǎn)錄組）,做一個(gè)簡單的了解,那我們來深入解讀一下静汤，看看如何解決這個(gè)問題琅催。（什么時(shí)候我們才能自己寫算法呢？而不是讀和借鑒別人的）虫给。

還是老辦法藤抡，先分享文章，后示例代碼

Absract

Clustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge（領(lǐng)域知識(shí)）.（我相信大家都是這樣的吧狰右，拿到矩陣之后，直接用Seurat進(jìn)行降維聚類分析了舆床，幾乎沒有用到什么先驗(yàn)的知識(shí)）When confronted by the high dimensionality and pervasive（普遍存在） dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters（大家遇到過么棋蚌？有的cluster差異基因很少甚至沒有，根本無法定義）挨队，which complicates cell type assignment.In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found（是的谷暮，盲目調(diào)參，這或許也是科服和臨檢最大的鴻溝吧）盛垦。Consequently, the path to obtaining biologically meaningful clusters can be ad hoc（特設(shè)） and laborious.Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step（利用先驗(yàn)知識(shí)參與聚類湿弦，這個(gè)其實(shí)也很難），Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.（最后一句話是套話腾夯，每篇文章都夸自己颊埃，不然發(fā)不出來??）

introduction

這個(gè)地方我們提煉一下
目前常用的降維方法PCA、TSNE蝶俱、UMAP班利、然后K-means、層次聚類進(jìn)行可視化榨呆，including
SC37 (Spectral clustering), pcaReduce8 (PCA + k-means + hierarchical),TSCAN9 (PCA + Gaussian mixture model) and mpath10 (Hierarchical), to name a few（真的非常多罗标，PCA的原理和深入探討我之前分享過，大家可以翻閱一下），然后由于單細(xì)胞數(shù)據(jù)存在的稀疏性（這里就指dropout和基因水平的高度變化）闯割，這些傳統(tǒng)的聚類方法其實(shí)會(huì)導(dǎo)致suboptimal results彻消。
Recently, various clustering methods have been proposed to overcome the challenges in scRNA-seq data analysis.
（1）Shared nearest neighbor (SNN)-Clip combines a quasi-clique-based clustering algorithm with the SNN-based similarity measure to automatically identify clusters in the high-dimensional and highvariable
scRNA-seq data（SNN就是Seurat聚類用到的方法）。
（2）DendroSplit對(duì)通過層次聚類獲得的樹狀圖進(jìn)行“分裂”和“合并”操作宙拉，該樹狀圖根據(jù)細(xì)胞的成對(duì)距離（根據(jù)選定的基因計(jì)算）對(duì)細(xì)胞進(jìn)行迭代分組宾尚，以揭示具有可解釋的超參數(shù)的生物學(xué)上有意義的種群的多個(gè)水平（層次聚類其實(shí)用到的很少）。
（3）If the dropout probability P(u) is a decreasing function of the gene expression u, CIDR uses a nonlinear least-squares regression to empirically estimate P(u) and imputes the gene expressions with a weighted average to alleviate the impact of dropouts.（個(gè)人感覺不太靠譜）鼓黔。
（4）Clustering analysis is performed on the first few principal coordinates, obtained through principal coordinate analysis (PCoA) on the imputed expression matrix（這個(gè)大家都是這么做的央勒，只是在選擇多少個(gè)主成分上可能會(huì)有差異）。
（5）SIMLR and MPSSC are both multiple kernel-based spectral clustering methods. Considering the complexities of the scRNAseq data, multiple kernel functions can help to learn robust similarity measures that correspond to different informative representations of the data（恕我直言澳化，這些方法我根本沒有聽說過崔步，??），However, spectral clustering relies on the full graph Laplacian matrix, which is prohibitively expensive to compute and store.（看來缺點(diǎn)很顯著缎谷，怪不得沒有聽說過）井濒。
（6）The high complexity and limited scalability generally impede applying these methods to large scRNA-seq datasets（單細(xì)胞數(shù)據(jù)具有的特點(diǎn)確實(shí)不能套用老一代的方法）。

模型部分

通過scRNA-seq進(jìn)行分析的大量細(xì)胞為研究人員提供了獨(dú)特的機(jī)會(huì)列林，可以應(yīng)用深度學(xué)習(xí)方法對(duì)嘈雜而復(fù)雜的scRNA-seq數(shù)據(jù)進(jìn)行建模瑞你。（這就是我的職業(yè)追求）。
（1）scScope and DCA(Deep Count Autoencoder) apply regular autoencoders to denoise single-cell gene expression data and impute（估算） the missing values（恕我之前希痴，也沒有聽說過）者甲。In autoencoders, the lowdimensional bottleneck layer enforces the encoder to learn only the essential latent representations and the decoding procedure ignores non-essential sources of variations of the expression data（很專業(yè)的東西，大家感興趣可以查一下）砌创。
（2）Compared to scScope, DCA explicitly models the overdispersion and zero-inflation with a zero-inflated negative binomial (ZINB) model-based loss function and learns gene-specific parameters (mean, dispersion and dropout probability) from the scRNA-seq data.（零膨脹負(fù)二項(xiàng)分布虏缸，不知道大家了解多少，用過scanpy的同學(xué)應(yīng)該知道）嫩实。
（3）SCVI and SCVIS are variational autoencoders (VAE) focusing on dimension reduction of scRNAseq
data（這方法也沒有聽說過刽辙，看來實(shí)力還是很差啊）. Unlike autoencoder, variational autoencoder assumes that latent representations learnt by the encoder follow a predefined distribution (typically a Gaussian distribution（高斯分布，單細(xì)胞i）). SCVIS uses the Student’s t-distributions（t分布） to replace the regular MSE-loss (mean square error) VAE, while SCVI applies the ZINB-loss VAE to characterize scRNA-seq data（分布上各有千秋）. Variational autoencoder is a deep generative model（生成模型）, but the assumption of latent representations following a Gaussian distribution might introduce the overregularization problem and compromise its performance（缺點(diǎn)依據(jù)很明顯甲献，怪不得沒怎么用過??）宰缤。
（4）More recently, Tian et al. developed a ZINB model-based deep clustering method (scDeepCluster) and showed that it could effectively characterize and cluster the discrete, over-dispersed and zero-inflated scRNA-seq count data.（自己寫文章引用自己的文章，很不錯(cuò)晃洒，而且零膨脹負(fù)二項(xiàng)分布是單細(xì)胞最常用的分布）慨灭，scDeepCluster combines the ZINB model-based autoencoder with the deep embedding clustering, which optimizes the latent feature learning and clustering simultaneously to achieve better clustering results.（作者認(rèn)為好不管用，要我們認(rèn)為可以）球及。

下游部分

Much of the downstream biological investigation relies on initial clustering results. Although clustering aims to explore and uncover new information缘挑，biologists expect to see some meaningful clusters that are consistent with their prior knowledge（典型的結(jié)果導(dǎo)向論，跟造假的距離不遠(yuǎn)了）桶略，In other words, totally exotic clustering with poor biological interpretability is puzzling, which is generally not desired by biologists.（但還是要基于客觀事實(shí)）语淘。For a clustering algorithm, it is good to accommodate biological interpretability while minimizing clustering loss from computational aspect（這是算法的目標(biāo)）诲宇，然而目前存在的算法只支持無監(jiān)督聚類（有監(jiān)督不見的比無監(jiān)督好），有時(shí)候不符合之前的先驗(yàn)知識(shí)惶翻，If a method initially fails to find a meaningful solution, the only recourse may be for the user to manually and repeatedly tweak clustering parameters until sufficiently good clusters are found 姑蓝。（這里大家要慎重啊，不要跟風(fēng)）吕粗。
We note that prior knowledge has become widely available in many cases（但是不見得都對(duì)）纺荧。Quite a few cell type-specific signature sets have been published（每個(gè)樣本的情況是不一樣的，不能完全同一颅筋，搞一刀切）. Ignoring prior information may lead to suboptimal, unexpected, and even illogical clustering results（這句話我不是特別贊同宙暇，算法的改進(jìn)可以理解，但是人為因素過多议泵，結(jié)果同樣不好）占贫。后面說了幾個(gè)做細(xì)胞定義的軟件，說句實(shí)話，細(xì)胞是一個(gè)動(dòng)態(tài)的過程，想要靠軟件識(shí)別是不太可能的菲语，而且先驗(yàn)知識(shí)不一定就適合所有的情況，不同組織哄辣，不同來源，不同品系，不同處理都會(huì)導(dǎo)致細(xì)胞的改變，所以這里的觀點(diǎn)我個(gè)人不太贊同烫葬。
However, there are several limitations of these methods.
（1）First,they are developed in the context of the marker genes and lack the flexibility to integrate other kinds of prior information. （人為因素千萬不可過多）
（2）Second, they are only applicable to scenarios where cell types are predefined and well-studied marker genes exist. （這個(gè)也不太對(duì)）
Poorly understood cell types would be invisible to these methods. Finally, they both ignore pervasive dropout events, a well-known problem for scRNA-seq data。

In this article, we are interested in integrating prior information into the modeling process to guide our deep learning model to simultaneously learn meaningful and desired latent representations and clusters（先驗(yàn)知識(shí)和機(jī)器學(xué)習(xí)聯(lián)合使用凡蜻，有了人為因素搭综，可要小心了），convert (partial) prior knowledge into soft pairwise constraints and add them as additional terms into the loss function for optimization（認(rèn)為加入外界因素）咽瓷，這個(gè)屬于半監(jiān)督范疇设凹，這個(gè)軟件scDCC

圖片.png

scDCC encodes prior knowledge into constraint information,which is integrated to the clustering procedure via a novel loss function舰讹，當(dāng)然茅姜，后面說自己的方法好，我們要批判性的看待（算法的部分我們?cè)贛ethod中分享**）月匣。

Result1 Pairwise constraints.

Pairwise constraints mainly focus on the together or apart guidance as defined by prior information and domain knowledge. They enforce small divergence between predefined “similar” samples, while enlarging the difference between “dissimilar” instances.（說白了钻洒，限定先驗(yàn)知識(shí)的“距離”，相似樣本和不相似樣本的距離的限定）锄开，Researchers usually encode the together and apart information into must-link (ML) and cannot-link (CL) constraints, respectively（信息歸類）素标，With the proper setup, pairwise constraints have been proved to be capable of defining any ground-truth partition（這基本就是機(jī)器學(xué)習(xí)啊），In the context of scRNA-seq studies, pairwise constraints can be constructed based on the cell distance computed using marker genes（marker gene哪里來的萍悴？其他人的头遭？不太靠譜吧）, cell sorting using flow cytometry, or other methods depending on real application scenarios
To evaluate the performance of pairwise constraints寓免，用到如下數(shù)據(jù);

圖片.png

We selected 10% of cells with known labels to generate constraints in each dataset and evaluated the performance of scDCC on the remaining 90% of cells.(這個(gè)方法恕我直言，der)计维，We show that the prior information encoded as soft constraints could help inform the latent representations of the remaining cells and therefore improve the clustering performance（這個(gè)地方簡直沒用）袜香。
Three clustering metrics：
（1）normalized mutual information (NMI)，range 0 to 1.
（2）clustering accuracy (CA)鲫惶，range 0 to 1蜈首。
（3）adjusted Rand index (ARI)（可參考蘭德指數(shù)），which ARI can be negative.
(科普一下欠母，蘭德指數(shù)需要給定實(shí)際類別信息C,假設(shè)K是聚類結(jié)果欢策，a表示在C與K中都是同類別的元素對(duì)數(shù)，b表示在C與K中都是不同類別的元素對(duì)數(shù)赏淌。評(píng)價(jià)同一object在兩種分類結(jié)果中是否被分到同一類別踩寇。)
A larger value indicates better concordance between the predicted labels and ground truth. The number of pairwise constraints fed into the model explicitly controls how much prior information is applied in the clustering process（局限性挺大的）。
看看文章的試驗(yàn)結(jié)果

圖片.png

當(dāng)然不錯(cuò)猜敢，文章的先驗(yàn)知識(shí)肯定是準(zhǔn)備充分的姑荷。For datasets that are difficult to cluster, imposing a small set of pairwise constraints significantly improves the results.With 6000 pairwise constraints, scDCC achieves acceptable performance on all four datasets（有這先驗(yàn)還需要再驗(yàn)證么？全定義得了）缩擂。

圖片.png

A random subset of corresponding ML (blue lines) and CL (red lines) constraints are also plotted（tsne）鼠冕。
As shown, the latent representations learned by the ZINB model-based autoencoder are noisy and different labels are mixed. Although the representations from scDeepCluster could separate different clusters, the inconsistency against the constraints still exists. Finally, by incorporating the soft constraints into the model training, scDCC was able to precisely separate the clusters and the results are consistent with both ML (blue lines) and CL (red lines) constraints.（自己的軟件表現(xiàn)最好，感覺很廢話胯盯，因?yàn)槟阋亩?/strong>）懈费。Overall, these results show that pairwise constraints can help to learn a better representation during the end-to-end learning procedure and improve clustering performance.

For the randomly selected 2100 cells in each dataset, we observed that scDCC with 0 constraint outperformed most competing scRNA-seq clustering methods（這個(gè)才是比較有意義的），(some strong methods outperformed scDCC with 0 constraints on some datasets, such as SC3 and Seurat on mouse bladder cells)（Seurat的聚類方法確實(shí)是比較好的）博脑，有了constraint的話scDCC表現(xiàn)最好憎乙，感覺比較扯。
下面的內(nèi)容很重點(diǎn)
In real applications, we recognize that constraint information may not be 100% accurate（有一半的真實(shí)性就很不錯(cuò)了）叉趣，To evaluate the robustness of the proposed method, we applied scDCC to the datasets with 5% and 10% erroneous pairwise constraints（有一定的先驗(yàn)錯(cuò)誤率我們看看會(huì)怎么樣）泞边，當(dāng)然了，穩(wěn)定性不錯(cuò)疗杉，不然見不到這個(gè)文章了阵谚，但錯(cuò)誤率有點(diǎn)高的時(shí)候，這個(gè)方法完全不行了烟具。Therefore, users
should take caution when adding highly erroneous constraints梢什。當(dāng)然，另外的驗(yàn)證結(jié)果也很好朝聋。

Result2 Robustness on highly dispersed genes.

Gene filtering is widely applied in many single-cell analysis pipelines（這一般是真正分析的第一步）嗡午，One typical gene filtering strategy is to filter out low variable genes and only keep highly dispersed genes.（選擇高變基因），Selecting highly dispersed genes could amplify the differences among cells but lose key information between cell clusters（這個(gè)冀痕，說的對(duì)么荔睹？狸演？？）To evaluate the robustness of scDCC on highly dispersed genes, we conducted experiments on the top 2000 highly dispersed genes of the four datasets and displayed the performances of scDCC and baseline methods僻他。當(dāng)然也不錯(cuò)严沥，但是用處不大。

Result3 Real applications and use cases.（看一下）

Generating accurate constraints is the key to successfully apply the proposed scDCC algorithm to obtain robust and desired clustering results（看來這是主要的限制條件了）中姜，兩種方式
（1）Protein marker-based constraints.
（2）Marker gene-based constraints.
都需要人為先label 啊消玄，看來任重而道遠(yuǎn)啊。

Methods

圖片.png

圖片.png

圖片.png

圖片.png

至于代碼在這里丢胚，scDCC

讀了這篇文章翩瓜，感覺生命在流逝

生活很好，有你更好携龟。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末兔跌，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子峡蟋，更是在濱河造成了極大的恐慌坟桅，老刑警劉巖，帶你破解...
沈念sama閱讀 206,013評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件蕊蝗，死亡現(xiàn)場(chǎng)離奇詭異仅乓，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)蓬戚，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門夸楣，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人子漩，你說我怎么就攤上這事豫喧。” “怎么了幢泼？”我有些...
開封第一講書人閱讀 152,370評(píng)論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵紧显，是天一觀的道長。經(jīng)常有香客問我缕棵，道長孵班，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,168評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任挥吵，我火速辦了婚禮重父，結(jié)果婚禮上花椭，老公的妹妹穿的比我還像新娘忽匈。我一直安慰自己，他們只是感情好矿辽，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布丹允。她就那樣靜靜地躺著郭厌，像睡著了一般。火紅的嫁衣襯著肌膚如雪雕蔽。梳的紋絲不亂的頭發(fā)上折柠，一...
開封第一講書人閱讀 48,954評(píng)論 1贊 283
城市分裂傳說
那天，我揣著相機(jī)與錄音批狐，去河邊找鬼扇售。笑死，一個(gè)胖子當(dāng)著我的面吹牛嚣艇，可吹牛的內(nèi)容都是我干的承冰。我是一名探鬼主播，決...
沈念sama閱讀 38,271評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼食零，長吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼困乒！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起贰谣，我...
開封第一講書人閱讀 36,916評(píng)論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤娜搂，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后吱抚，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體百宇，經(jīng)...
沈念sama閱讀 43,382評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年秘豹，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了恳谎。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 37,989評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡憋肖，死狀恐怖因痛，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情岸更，我是刑警寧澤鸵膏，帶...
沈念sama閱讀 33,624評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站怎炊，受9級(jí)特大地震影響谭企，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜评肆，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一债查、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧瓜挽，春花似錦盹廷、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評(píng)論 0贊 19
一樁弒父案俄占，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽管怠。三九已至，卻和暖如春缸榄，著一層夾襖步出監(jiān)牢的瞬間渤弛，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,418評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工甚带，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留她肯，地道東北人。一個(gè)月前我還...
沈念sama閱讀 45,401評(píng)論 2贊 352
代替公主和親
正文我出身青樓鹰贵，卻偏偏與公主長得像辕宏，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子砾莱，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評(píng)論 2贊 345

10X單細(xì)胞（10X空間轉(zhuǎn)錄組）聚類分析之scDCC

還是老辦法藤抡，先分享文章，后示例代碼

Absract

introduction

模型部分

下游部分

Result1 Pairwise constraints.

Result2 Robustness on highly dispersed genes.

Result3 Real applications and use cases.（看一下）

Methods

推薦閱讀更多精彩內(nèi)容