hi,各位好,今天我們努努力,看一下10X單細(xì)胞和10X空間轉(zhuǎn)錄組普遍存在的dropout現(xiàn)象對(duì)我們數(shù)據(jù)分析的影響和文章中的方法scDCC是如何規(guī)避的窜醉,文章在Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data,2021年3月發(fā)表于NC,中國人發(fā)表的艺谆,不算低了榨惰,有關(guān)dropout的知識(shí),大家可以看我之前分享的文章深度學(xué)習(xí)中Dropout原理解析(10X單細(xì)胞和10X空間轉(zhuǎn)錄組),做一個(gè)簡單的了解,那我們來深入解讀一下静汤,看看如何解決這個(gè)問題琅催。(什么時(shí)候我們才能自己寫算法呢?而不是讀和借鑒別人的)虫给。
還是老辦法藤抡,先分享文章,后示例代碼
Absract
Clustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge(領(lǐng)域知識(shí)).(我相信大家都是這樣的吧狰右,拿到矩陣之后,直接用Seurat進(jìn)行降維聚類分析了舆床,幾乎沒有用到什么先驗(yàn)的知識(shí))When confronted by the high dimensionality and pervasive(普遍存在) dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters(大家遇到過么棋蚌?有的cluster差異基因很少甚至沒有,根本無法定義)挨队,which complicates cell type assignment.In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found(是的谷暮,盲目調(diào)參,這或許也是科服和臨檢最大的鴻溝吧)盛垦。Consequently, the path to obtaining biologically meaningful clusters can be ad hoc(特設(shè)) and laborious.Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step(利用先驗(yàn)知識(shí)參與聚類湿弦,這個(gè)其實(shí)也很難),Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.(最后一句話是套話腾夯,每篇文章都夸自己颊埃,不然發(fā)不出來??)
introduction
這個(gè)地方我們提煉一下
目前常用的降維方法PCA、TSNE蝶俱、UMAP班利、然后K-means、層次聚類進(jìn)行可視化榨呆,including
SC37 (Spectral clustering), pcaReduce8 (PCA + k-means + hierarchical),TSCAN9 (PCA + Gaussian mixture model) and mpath10 (Hierarchical), to name a few(真的非常多罗标,PCA的原理和深入探討我之前分享過,大家可以翻閱一下),然后由于單細(xì)胞數(shù)據(jù)存在的稀疏性(這里就指dropout和基因水平的高度變化)闯割,這些傳統(tǒng)的聚類方法其實(shí)會(huì)導(dǎo)致suboptimal results彻消。
Recently, various clustering methods have been proposed to overcome the challenges in scRNA-seq data analysis.
(1)Shared nearest neighbor (SNN)-Clip combines a quasi-clique-based clustering algorithm with the SNN-based similarity measure to automatically identify clusters in the high-dimensional and highvariable
scRNA-seq data(SNN就是Seurat聚類用到的方法)。
(2)DendroSplit對(duì)通過層次聚類獲得的樹狀圖進(jìn)行“分裂”和“合并”操作宙拉,該樹狀圖根據(jù)細(xì)胞的成對(duì)距離(根據(jù)選定的基因計(jì)算)對(duì)細(xì)胞進(jìn)行迭代分組宾尚,以揭示具有可解釋的超參數(shù)的生物學(xué)上有意義的種群的多個(gè)水平 (層次聚類其實(shí)用到的很少)。
(3)If the dropout probability P(u) is a decreasing function of the gene expression u, CIDR uses a nonlinear least-squares regression to empirically estimate P(u) and imputes the gene expressions with a weighted average to alleviate the impact of dropouts.(個(gè)人感覺不太靠譜)鼓黔。
(4)Clustering analysis is performed on the first few principal coordinates, obtained through principal coordinate analysis (PCoA) on the imputed expression matrix(這個(gè)大家都是這么做的央勒,只是在選擇多少個(gè)主成分上可能會(huì)有差異)。
(5)SIMLR and MPSSC are both multiple kernel-based spectral clustering methods. Considering the complexities of the scRNAseq data, multiple kernel functions can help to learn robust similarity measures that correspond to different informative representations of the data(恕我直言澳化,這些方法我根本沒有聽說過崔步,??),However, spectral clustering relies on the full graph Laplacian matrix, which is prohibitively expensive to compute and store.(看來缺點(diǎn)很顯著缎谷,怪不得沒有聽說過)井濒。
(6)The high complexity and limited scalability generally impede applying these methods to large scRNA-seq datasets(單細(xì)胞數(shù)據(jù)具有的特點(diǎn)確實(shí)不能套用老一代的方法)。
模型部分
通過scRNA-seq進(jìn)行分析的大量細(xì)胞為研究人員提供了獨(dú)特的機(jī)會(huì)列林,可以應(yīng)用深度學(xué)習(xí)方法對(duì)嘈雜而復(fù)雜的scRNA-seq數(shù)據(jù)進(jìn)行建模瑞你。(這就是我的職業(yè)追求)。
(1)scScope and DCA(Deep Count Autoencoder) apply regular autoencoders to denoise single-cell gene expression data and impute(估算) the missing values(恕我之前希痴,也沒有聽說過)者甲。In autoencoders, the lowdimensional bottleneck layer enforces the encoder to learn only the essential latent representations and the decoding procedure ignores non-essential sources of variations of the expression data(很專業(yè)的東西,大家感興趣可以查一下)砌创。
(2)Compared to scScope, DCA explicitly models the overdispersion and zero-inflation with a zero-inflated negative binomial (ZINB) model-based loss function and learns gene-specific parameters (mean, dispersion and dropout probability) from the scRNA-seq data.(零膨脹負(fù)二項(xiàng)分布虏缸,不知道大家了解多少,用過scanpy的同學(xué)應(yīng)該知道)嫩实。
(3)SCVI and SCVIS are variational autoencoders (VAE) focusing on dimension reduction of scRNAseq
data(這方法也沒有聽說過刽辙,看來實(shí)力還是很差啊). Unlike autoencoder, variational autoencoder assumes that latent representations learnt by the encoder follow a predefined distribution (typically a Gaussian distribution(高斯分布,單細(xì)胞i)). SCVIS uses the Student’s t-distributions(t分布) to replace the regular MSE-loss (mean square error) VAE, while SCVI applies the ZINB-loss VAE to characterize scRNA-seq data(分布上各有千秋). Variational autoencoder is a deep generative model(生成模型), but the assumption of latent representations following a Gaussian distribution might introduce the overregularization problem and compromise its performance(缺點(diǎn)依據(jù)很明顯甲献,怪不得沒怎么用過??)宰缤。
(4)More recently, Tian et al. developed a ZINB model-based deep clustering method (scDeepCluster) and showed that it could effectively characterize and cluster the discrete, over-dispersed and zero-inflated scRNA-seq count data.(自己寫文章引用自己的文章,很不錯(cuò)晃洒,而且零膨脹負(fù)二項(xiàng)分布是單細(xì)胞最常用的分布)慨灭,scDeepCluster combines the ZINB model-based autoencoder with the deep embedding clustering, which optimizes the latent feature learning and clustering simultaneously to achieve better clustering results.(作者認(rèn)為好不管用,要我們認(rèn)為可以)球及。
下游部分
Much of the downstream biological investigation relies on initial clustering results. Although clustering aims to explore and uncover new information缘挑,biologists expect to see some meaningful clusters that are consistent with their prior knowledge(典型的結(jié)果導(dǎo)向論,跟造假的距離不遠(yuǎn)了)桶略,In other words, totally exotic clustering with poor biological interpretability is puzzling, which is generally not desired by biologists.(但還是要基于客觀事實(shí))语淘。For a clustering algorithm, it is good to accommodate biological interpretability while minimizing clustering loss from computational aspect(這是算法的目標(biāo))诲宇,然而目前存在的算法只支持無監(jiān)督聚類(有監(jiān)督不見的比無監(jiān)督好),有時(shí)候不符合之前的先驗(yàn)知識(shí)惶翻,If a method initially fails to find a meaningful solution, the only recourse may be for the user to manually and repeatedly tweak clustering parameters until sufficiently good clusters are found 姑蓝。(這里大家要慎重啊,不要跟風(fēng))吕粗。
We note that prior knowledge has become widely available in many cases(但是不見得都對(duì))纺荧。Quite a few cell type-specific signature sets have been published(每個(gè)樣本的情況是不一樣的,不能完全同一颅筋,搞一刀切). Ignoring prior information may lead to suboptimal, unexpected, and even illogical clustering results(這句話我不是特別贊同宙暇,算法的改進(jìn)可以理解,但是人為因素過多议泵,結(jié)果同樣不好)占贫。后面說了幾個(gè)做細(xì)胞定義的軟件,說句實(shí)話,細(xì)胞是一個(gè)動(dòng)態(tài)的過程,想要靠軟件識(shí)別是不太可能的菲语,而且先驗(yàn)知識(shí)不一定就適合所有的情況,不同組織哄辣,不同來源,不同品系,不同處理都會(huì)導(dǎo)致細(xì)胞的改變,所以這里的觀點(diǎn)我個(gè)人不太贊同烫葬。
However, there are several limitations of these methods.
(1)First,they are developed in the context of the marker genes and lack the flexibility to integrate other kinds of prior information. (人為因素千萬不可過多)
(2)Second, they are only applicable to scenarios where cell types are predefined and well-studied marker genes exist. (這個(gè)也不太對(duì))
Poorly understood cell types would be invisible to these methods. Finally, they both ignore pervasive dropout events, a well-known problem for scRNA-seq data。
In this article, we are interested in integrating prior information into the modeling process to guide our deep learning model to simultaneously learn meaningful and desired latent representations and clusters(先驗(yàn)知識(shí)和機(jī)器學(xué)習(xí)聯(lián)合使用凡蜻,有了人為因素搭综,可要小心了),convert (partial) prior knowledge into soft pairwise constraints and add them as additional terms into the loss function for optimization(認(rèn)為加入外界因素)咽瓷,這個(gè)屬于半監(jiān)督范疇设凹,這個(gè)軟件scDCC
scDCC encodes prior knowledge into constraint information,which is integrated to the clustering procedure via a novel loss function舰讹,當(dāng)然茅姜,后面說自己的方法好,我們要批判性的看待(算法的部分我們?cè)贛ethod中分享**)月匣。
Result1 Pairwise constraints.
Pairwise constraints mainly focus on the together or apart guidance as defined by prior information and domain knowledge. They enforce small divergence between predefined “similar” samples, while enlarging the difference between “dissimilar” instances.(說白了钻洒,限定先驗(yàn)知識(shí)的“距離”,相似樣本和不相似樣本的距離的限定)锄开,Researchers usually encode the together and apart information into must-link (ML) and cannot-link (CL) constraints, respectively(信息歸類)素标,With the proper setup, pairwise constraints have been proved to be capable of defining any ground-truth partition(這基本就是機(jī)器學(xué)習(xí)啊),In the context of scRNA-seq studies, pairwise constraints can be constructed based on the cell distance computed using marker genes(marker gene哪里來的萍悴?其他人的头遭?不太靠譜吧), cell sorting using flow cytometry, or other methods depending on real application scenarios
To evaluate the performance of pairwise constraints寓免,用到如下數(shù)據(jù);
We selected 10% of cells with known labels to generate constraints in each dataset and evaluated the performance of scDCC on the remaining 90% of cells.(這個(gè)方法恕我直言,der)计维,We show that the prior information encoded as soft constraints could help inform the latent representations of the remaining cells and therefore improve the clustering performance(這個(gè)地方簡直沒用)袜香。
Three clustering metrics:
(1)normalized mutual information (NMI),range 0 to 1.
(2)clustering accuracy (CA)鲫惶,range 0 to 1蜈首。
(3)adjusted Rand index (ARI)(可參考蘭德指數(shù)),which ARI can be negative.
(科普一下欠母,蘭德指數(shù)需要給定實(shí)際類別信息C,假設(shè)K是聚類結(jié)果欢策,a表示在C與K中都是同類別的元素對(duì)數(shù),b表示在C與K中都是不同類別的元素對(duì)數(shù)赏淌。評(píng)價(jià)同一object在兩種分類結(jié)果中是否被分到同一類別踩寇。)
A larger value indicates better concordance between the predicted labels and ground truth. The number of pairwise constraints fed into the model explicitly controls how much prior information is applied in the clustering process(局限性挺大的)。
看看文章的試驗(yàn)結(jié)果
當(dāng)然不錯(cuò)猜敢,文章的先驗(yàn)知識(shí)肯定是準(zhǔn)備充分的姑荷。For datasets that are difficult to cluster, imposing a small set of pairwise constraints significantly improves the results.With 6000 pairwise constraints, scDCC achieves acceptable performance on all four datasets(有這先驗(yàn)還需要再驗(yàn)證么?全定義得了)缩擂。
A random subset of corresponding ML (blue lines) and CL (red lines) constraints are also plotted(tsne)鼠冕。
As shown, the latent representations learned by the ZINB model-based autoencoder are noisy and different labels are mixed. Although the representations from scDeepCluster could separate different clusters, the inconsistency against the constraints still exists. Finally, by incorporating the soft constraints into the model training, scDCC was able to precisely separate the clusters and the results are consistent with both ML (blue lines) and CL (red lines) constraints.(自己的軟件表現(xiàn)最好,感覺很廢話胯盯,因?yàn)槟阋亩?/strong>)懈费。Overall, these results show that pairwise constraints can help to learn a better representation during the end-to-end learning procedure and improve clustering performance.
For the randomly selected 2100 cells in each dataset, we observed that scDCC with 0 constraint outperformed most competing scRNA-seq clustering methods(這個(gè)才是比較有意義的),(some strong methods outperformed scDCC with 0 constraints on some datasets, such as SC3 and Seurat on mouse bladder cells)(Seurat的聚類方法確實(shí)是比較好的)博脑,有了constraint的話scDCC表現(xiàn)最好憎乙,感覺比較扯。
下面的內(nèi)容很重點(diǎn)
In real applications, we recognize that constraint information may not be 100% accurate(有一半的真實(shí)性就很不錯(cuò)了)叉趣,To evaluate the robustness of the proposed method, we applied scDCC to the datasets with 5% and 10% erroneous pairwise constraints(有一定的先驗(yàn)錯(cuò)誤率我們看看會(huì)怎么樣)泞边,當(dāng)然了,穩(wěn)定性不錯(cuò)疗杉,不然見不到這個(gè)文章了阵谚,但錯(cuò)誤率有點(diǎn)高的時(shí)候,這個(gè)方法完全不行了烟具。Therefore, users
should take caution when adding highly erroneous constraints梢什。當(dāng)然,另外的驗(yàn)證結(jié)果也很好朝聋。
Result2 Robustness on highly dispersed genes.
Gene filtering is widely applied in many single-cell analysis pipelines(這一般是真正分析的第一步)嗡午,One typical gene filtering strategy is to filter out low variable genes and only keep highly dispersed genes.(選擇高變基因),Selecting highly dispersed genes could amplify the differences among cells but lose key information between cell clusters(這個(gè)冀痕,說的對(duì)么荔睹?狸演??)To evaluate the robustness of scDCC on highly dispersed genes, we conducted experiments on the top 2000 highly dispersed genes of the four datasets and displayed the performances of scDCC and baseline methods僻他。當(dāng)然也不錯(cuò)严沥,但是用處不大。
Result3 Real applications and use cases.(看一下)
Generating accurate constraints is the key to successfully apply the proposed scDCC algorithm to obtain robust and desired clustering results(看來這是主要的限制條件了)中姜,兩種方式
(1)Protein marker-based constraints.
(2)Marker gene-based constraints.
都需要人為先label 啊消玄,看來任重而道遠(yuǎn)啊。
Methods
至于代碼在這里丢胚,scDCC
讀了這篇文章翩瓜,感覺生命在流逝
生活很好,有你更好携龟。