10X單細胞降維分析之PHATE

目前單細胞數(shù)據(jù)做降維分析的方法有很多(PCA,TSNE,UMAP)劳吠,大家不用一個一個的去試,掌握一些主要的分析軟件延蟹,深入理解其中的原理和代碼雇锡,實現(xiàn)軟件之間的有優(yōu)勢互補结洼,達到我們的分析目的。

今天給大家分享一個方法,文獻在Visualizing structure and transitions in high-dimensional biological data,影響因子36分多,相當高了峭弟。今天我們的任務就是來參透文章及分享代碼,大家一定要認真學習脱拼,掌握精髓瞒瘸,而不是簡單的copy 代碼。

文章部分:

一熄浓、摘要:

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.(這部分沒什么意思情臭,夸自己的軟件唄

二、簡介

首先單細胞數(shù)據(jù)確實需要非常好的可視化軟件赌蔑,目前存在的可視化軟件包括principalcomponent analysis (PCA)俯在、 t-distributed stochastic neighbor embedding (t-SNE)and Uniform Manifold Approximation and Projection (UMAP),其實大家現(xiàn)在用的最多的應該就是UMAP娃惯,然而跷乐,these methods are suboptimal for exploring high-dimensional biological data.至于原因:
(1)such methods tend to be sensitive to noise.(這個地方不知道大家研究過沒,單細胞數(shù)據(jù)的降噪和droplet的分析)石景,methods like PCA and Isomap fail to explicitly remove this noise for visualization, rendering fine-grained local structure impossible to recognize.(這個地方需要注意劈猿,PCA確實有這個問題)
(2)nonlinear visualization methods such as t-SNE often scramble the global structure in data(全局結構不夠精確,所以現(xiàn)在更多的用UMAP)潮孽。
(3)many dimensionality-reduction methods (for example, PCA and diffusion maps) fail to optimize for two-dimensional (2D) visualization as they are not specifically designed for visualization.(聽過我的課程的同學是不是很熟悉>救佟!??)
(4)common implementations of dimensionality reduction methods often lack computational scalability往史。(擴展性差)仗颈,State-of-the-art methods such as multidimensional scaling (MDS) and t-SNE were originally presented as proofs-of-concept with somewhat naive implementations, which do not scale well to datasets with hundreds of thousands, let alone millions, of data points owing to speed or memory constraints.(這個地方不知道大家有沒有研究過,再次強調椎例,不要只是照抄代碼挨决,做一個理性的人)。
(5)some methods try to alleviate visualization challenges by directly imposing a fixed geometry or intrinsic structure on the data.However, methods that impose a structure
on the data generally have no way of alerting the user whether the structural assumption is correct.(這個地方許多新的軟件已經修正了)订歪。作者舉了例子脖祈,any data will be transformed to fit a tree with Monocle212 or clusters with t-SNE. While such methods are useful for data that fit their prior assumptions, they can generate misleading results otherwise, and are often ill suited for hypothesis generation or data exploration(這個地方大家很熟悉吧,為什么聚類和monocle2的結果總是不盡如人意刷晋,明白了吧8歉摺!)
接下來就是PHATE軟件的優(yōu)勢了眼虱,我們略過喻奥。。捏悬。撞蚕。。过牙。
provides an accurate, denoised representation of both local and global structure of a dataset in the required number of dimensions without imposing any strong assumptions on the structure of the data, and is highly scalable both in memory and runtime.

圖片.png

三甥厦、Result

我們現(xiàn)在看一些基礎的知識
(1)t-SNE focuses on preserving local structure, often at the expense of the global structure
(2)PCA focuses on preserving global structure at the expense of the local structure
(3)Although PCA is often used for denoising as a preprocessing step, both PCA and t-SNE provide noisy visualizations when the data is noisy, which can obscure the structure of the data(這個地方大家一定找掌握,不然分析數(shù)據(jù)完了也不知道對還是錯)寇钉。
(4)By contrast, diffusion maps effectively denoise data and learn the local and global structure.However, diffusion maps typically encode this information in higher dimensions, which are not amenable to visualization, and can introduce distortions in the visualization under certain conditions(diffusion maps的方法矫渔,之前的課程講過的)。


圖片.png

重點來了摧莽,PHATE is designed to overcome these weaknesses and provide a visualization that preserves the local and global structure of the data, denoises the data and presents as much information as possible into low dimensions.


圖片.png

我們來看一下主要的步驟:

(1)Encode local data information via local similarities (局部結構)庙洼,這里使用的距離仍然是歐氏距離(R語言里面對于距離的定義我課上講過,基礎大家一定要知道)镊辕。


圖片.png

(2)Encode global relationships in data using the potential distance油够。這里用到的就是diffusion map的算法,這個課上我也講過征懈。
(3)Embed potential distance information into low dimensions for visualization.(低維可視化)this ensures that all variability is squeezed into two dimensions for a maximally informative embedding


圖片.png
石咬。

文獻推薦的分析策略

Here we present new methods that provide suggested end points, branch points and branches on the basis of the information from higher-dimensional PHATE embeddings(數(shù)據(jù)結構的分析,大家其實可以看得出來卖哎,結構與monocle2樹形結構差不多)鬼悠。
(1)Branch-point identification with local intrinsic dimensionality删性。大家看一下下圖對于branch points的定義。branch points often encapsulate switch-like decisions where cells sharply veer towards one of a small number of fates焕窝。


圖片.png

圖片.png

(2)End-point identification with diffusion extrema.(這個軟件居然還要識別end points蹬挺,跟URD有一拼。)We identify end points in the PHATE embedding as those that are least central and most distinct by computing the eigenvector centrality and the distinctness of a cellular state relative to the general data by considering the minima and maxima of diffusion eigenvectors as motivated by ref.這個地方有興趣可以好好研究一下它掂, branch point和end spoint的識別巴帮,以及填充細胞到軌跡上,對先驗知識要求很高虐秋,當然也就意味著更為準確榕茧。我們看一下填充的效果


圖片.png

跟力導向布局差不多。

軟件之間的比較客给。

這部分我們簡單看一下就可以了用押。


圖片.png

看一下結果,當然靶剑,PHATE的準確度高只恨,這個從理論上講試必然的,因為PHATE對于人為的監(jiān)督要求更高抬虽。PHATE had the highest DEMaP score in 22 of 24 comparisons and was the top-performing method overall官觅。Uniform manifold approximation and projection (UMAP) was the second best performing method overall but had the highest DEMaP score in only two of the comparisons, one of which is equal with PHATE.(UMAP的優(yōu)勢)。

不同方法之間的降維可視化比較
圖片.png

PHATE provides a clean and relatively denoised visualization of the data that highlights both the local and global structure阐污。當然休涤,后面還有一些數(shù)據(jù)分析的結果,這都是套路了笛辟,大家看一下就可以功氨。

其實我們這里總結一句,PHATE解決的問題就是手幢,降維可視化的結果與細胞本身的內在聯(lián)系相互對應捷凄,PHATE方法最好,UMAP次之围来。

接下來跺涤,我們看一下代碼:

加載模塊

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import phate
import scprep
import sklearn.decomposition # PCA
import sklearn.manifold # t-SNE
import umap

至于讀取數(shù)據(jù),質控之類的我們這里就不分享了监透,就看PHATE降維可視化桶错,

phate_operator.set_params(knn=4, decay=15, t=12)
Y_phate = phate_operator.fit_transform(EBT_counts)
這個地方我們來關注一下參數(shù)問題:
    knn : Number of nearest neighbors (default: 5). Increase this (e.g. to 20) if your PHATE embedding appears very disconnected. You should also consider increasing knn if your dataset is extremely large (e.g. >100k cells)
    decay : Alpha decay (default: 15). Decreasing decay increases connectivity on the graph, increasing decay decreases connectivity. This rarely needs to be tuned. Set it to None for a k-nearest neighbors kernel.
    t : Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing done to the data. It is chosen automatically by default, but you can increase it if your embedding lacks structure, or decrease it if the structure looks too compact.
    gamma : Informational distance constant (default: 1). gamma=1 gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try gamma=0.

如果真如文章所說,PHATE有能力learn and maintain local and global distances in low dimensional space,那么這個可視化的結果胀蛮,高于UMAP院刁,是最合適的。


圖片.png
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
禁止轉載粪狼,如需轉載請通過簡信或評論聯(lián)系作者退腥。
  • 序言:七十年代末任岸,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子狡刘,更是在濱河造成了極大的恐慌享潜,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件颓帝,死亡現(xiàn)場離奇詭異,居然都是意外死亡窝革,警方通過查閱死者的電腦和手機购城,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來虐译,“玉大人瘪板,你說我怎么就攤上這事∑岱蹋” “怎么了侮攀?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵,是天一觀的道長厢拭。 經常有香客問我兰英,道長,這世上最難降的妖魔是什么供鸠? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任畦贸,我火速辦了婚禮,結果婚禮上楞捂,老公的妹妹穿的比我還像新娘薄坏。我一直安慰自己,他們只是感情好寨闹,可當我...
    茶點故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布胶坠。 她就那樣靜靜地躺著,像睡著了一般繁堡。 火紅的嫁衣襯著肌膚如雪沈善。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天椭蹄,我揣著相機與錄音矮瘟,去河邊找鬼。 笑死塑娇,一個胖子當著我的面吹牛澈侠,可吹牛的內容都是我干的。 我是一名探鬼主播埋酬,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼哨啃,長吁一口氣:“原來是場噩夢啊……” “哼烧栋!你這毒婦竟也來了?” 一聲冷哼從身側響起拳球,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤审姓,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后祝峻,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體魔吐,經...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年莱找,在試婚紗的時候發(fā)現(xiàn)自己被綠了酬姆。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡奥溺,死狀恐怖辞色,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情浮定,我是刑警寧澤相满,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布,位于F島的核電站桦卒,受9級特大地震影響立美,放射性物質發(fā)生泄漏。R本人自食惡果不足惜方灾,卻給世界環(huán)境...
    茶點故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一悯辙、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧迎吵,春花似錦躲撰、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至蔫巩,卻和暖如春谆棱,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背圆仔。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工垃瞧, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人坪郭。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓个从,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子嗦锐,可洞房花燭夜當晚...
    茶點故事閱讀 42,700評論 2 345