目前單細胞數(shù)據(jù)做降維分析的方法有很多(PCA,TSNE,UMAP)劳吠,大家不用一個一個的去試,掌握一些主要的分析軟件延蟹,深入理解其中的原理和代碼雇锡,實現(xiàn)軟件之間的有優(yōu)勢互補结洼,達到我們的分析目的。
今天給大家分享一個方法,文獻在Visualizing structure and transitions in high-dimensional biological data,影響因子36分多,相當高了峭弟。今天我們的任務就是來參透文章及分享代碼,大家一定要認真學習脱拼,掌握精髓瞒瘸,而不是簡單的copy 代碼。
文章部分:
一熄浓、摘要:
The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.(這部分沒什么意思情臭,夸自己的軟件唄)
二、簡介
首先單細胞數(shù)據(jù)確實需要非常好的可視化軟件赌蔑,目前存在的可視化軟件包括principalcomponent analysis (PCA)俯在、 t-distributed stochastic neighbor embedding (t-SNE)and Uniform Manifold Approximation and Projection (UMAP),其實大家現(xiàn)在用的最多的應該就是UMAP娃惯,然而跷乐,these methods are suboptimal for exploring high-dimensional biological data.至于原因:
(1)such methods tend to be sensitive to noise.(這個地方不知道大家研究過沒,單細胞數(shù)據(jù)的降噪和droplet的分析)石景,methods like PCA and Isomap fail to explicitly remove this noise for visualization, rendering fine-grained local structure impossible to recognize.(這個地方需要注意劈猿,PCA確實有這個問題)
(2)nonlinear visualization methods such as t-SNE often scramble the global structure in data(全局結構不夠精確,所以現(xiàn)在更多的用UMAP)潮孽。
(3)many dimensionality-reduction methods (for example, PCA and diffusion maps) fail to optimize for two-dimensional (2D) visualization as they are not specifically designed for visualization.(聽過我的課程的同學是不是很熟悉>救佟!??)
(4)common implementations of dimensionality reduction methods often lack computational scalability往史。(擴展性差)仗颈,State-of-the-art methods such as multidimensional scaling (MDS) and t-SNE were originally presented as proofs-of-concept with somewhat naive implementations, which do not scale well to datasets with hundreds of thousands, let alone millions, of data points owing to speed or memory constraints.(這個地方不知道大家有沒有研究過,再次強調椎例,不要只是照抄代碼挨决,做一個理性的人)。
(5)some methods try to alleviate visualization challenges by directly imposing a fixed geometry or intrinsic structure on the data.However, methods that impose a structure
on the data generally have no way of alerting the user whether the structural assumption is correct.(這個地方許多新的軟件已經修正了)订歪。作者舉了例子脖祈,any data will be transformed to fit a tree with Monocle212 or clusters with t-SNE. While such methods are useful for data that fit their prior assumptions, they can generate misleading results otherwise, and are often ill suited for hypothesis generation or data exploration(這個地方大家很熟悉吧,為什么聚類和monocle2的結果總是不盡如人意刷晋,明白了吧8歉摺!)
接下來就是PHATE軟件的優(yōu)勢了眼虱,我們略過喻奥。。捏悬。撞蚕。。过牙。
provides an accurate, denoised representation of both local and global structure of a dataset in the required number of dimensions without imposing any strong assumptions on the structure of the data, and is highly scalable both in memory and runtime.
三甥厦、Result
我們現(xiàn)在看一些基礎的知識
(1)t-SNE focuses on preserving local structure, often at the expense of the global structure
(2)PCA focuses on preserving global structure at the expense of the local structure
(3)Although PCA is often used for denoising as a preprocessing step, both PCA and t-SNE provide noisy visualizations when the data is noisy, which can obscure the structure of the data(這個地方大家一定找掌握,不然分析數(shù)據(jù)完了也不知道對還是錯)寇钉。
(4)By contrast, diffusion maps effectively denoise data and learn the local and global structure.However, diffusion maps typically encode this information in higher dimensions, which are not amenable to visualization, and can introduce distortions in the visualization under certain conditions(diffusion maps的方法矫渔,之前的課程講過的)。
重點來了摧莽,PHATE is designed to overcome these weaknesses and provide a visualization that preserves the local and global structure of the data, denoises the data and presents as much information as possible into low dimensions.
我們來看一下主要的步驟:
(1)Encode local data information via local similarities (局部結構)庙洼,這里使用的距離仍然是歐氏距離(R語言里面對于距離的定義我課上講過,基礎大家一定要知道)镊辕。
(2)Encode global relationships in data using the potential distance油够。這里用到的就是diffusion map的算法,這個課上我也講過征懈。
(3)Embed potential distance information into low dimensions for visualization.(低維可視化)this ensures that all variability is squeezed into two dimensions for a maximally informative embedding
文獻推薦的分析策略
Here we present new methods that provide suggested end points, branch points and branches on the basis of the information from higher-dimensional PHATE embeddings(數(shù)據(jù)結構的分析,大家其實可以看得出來卖哎,結構與monocle2樹形結構差不多)鬼悠。
(1)Branch-point identification with local intrinsic dimensionality删性。大家看一下下圖對于branch points的定義。branch points often encapsulate switch-like decisions where cells sharply veer towards one of a small number of fates焕窝。
(2)End-point identification with diffusion extrema.(這個軟件居然還要識別end points蹬挺,跟URD有一拼。)We identify end points in the PHATE embedding as those that are least central and most distinct by computing the eigenvector centrality and the distinctness of a cellular state relative to the general data by considering the minima and maxima of diffusion eigenvectors as motivated by ref.這個地方有興趣可以好好研究一下它掂, branch point和end spoint的識別巴帮,以及填充細胞到軌跡上,對先驗知識要求很高虐秋,當然也就意味著更為準確榕茧。我們看一下填充的效果
跟力導向布局差不多。
軟件之間的比較客给。
這部分我們簡單看一下就可以了用押。
看一下結果,當然靶剑,PHATE的準確度高只恨,這個從理論上講試必然的,因為PHATE對于人為的監(jiān)督要求更高抬虽。PHATE had the highest DEMaP score in 22 of 24 comparisons and was the top-performing method overall官觅。Uniform manifold approximation and projection (UMAP) was the second best performing method overall but had the highest DEMaP score in only two of the comparisons, one of which is equal with PHATE.(UMAP的優(yōu)勢)。
不同方法之間的降維可視化比較
PHATE provides a clean and relatively denoised visualization of the data that highlights both the local and global structure阐污。當然休涤,后面還有一些數(shù)據(jù)分析的結果,這都是套路了笛辟,大家看一下就可以功氨。
其實我們這里總結一句,PHATE解決的問題就是手幢,降維可視化的結果與細胞本身的內在聯(lián)系相互對應捷凄,PHATE方法最好,UMAP次之围来。
接下來跺涤,我們看一下代碼:
加載模塊
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import phate
import scprep
import sklearn.decomposition # PCA
import sklearn.manifold # t-SNE
import umap
至于讀取數(shù)據(jù),質控之類的我們這里就不分享了监透,就看PHATE降維可視化桶错,
phate_operator.set_params(knn=4, decay=15, t=12)
Y_phate = phate_operator.fit_transform(EBT_counts)
這個地方我們來關注一下參數(shù)問題:
knn : Number of nearest neighbors (default: 5). Increase this (e.g. to 20) if your PHATE embedding appears very disconnected. You should also consider increasing knn if your dataset is extremely large (e.g. >100k cells)
decay : Alpha decay (default: 15). Decreasing decay increases connectivity on the graph, increasing decay decreases connectivity. This rarely needs to be tuned. Set it to None for a k-nearest neighbors kernel.
t : Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing done to the data. It is chosen automatically by default, but you can increase it if your embedding lacks structure, or decrease it if the structure looks too compact.
gamma : Informational distance constant (default: 1). gamma=1 gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try gamma=0.
如果真如文章所說,PHATE有能力learn and maintain local and global distances in low dimensional space,那么這個可視化的結果胀蛮,高于UMAP院刁,是最合適的。