hello假哎,周三了瞬捕,一周的最黃金事件舵抹,時(shí)間很快肪虎,需要珍惜惧蛹,功夫熊貓里面有一句臺(tái)詞扇救,放棄香嗓,不放棄迅腔;做面條陶缺,不做面條钾挟。你患得患失饱岸,太在意從前掺出,又太擔(dān)心將來(lái)。有句話說(shuō)的好:昨天是段歷史汤锨,明天是個(gè)謎團(tuán)双抽,而今天是天賜的禮物闲礼。像珍惜禮物那樣珍惜今天牍汹。
今天給大家分享一種可以在所有粒度級(jí)別上學(xué)習(xí)和可視化抽象細(xì)胞特征和數(shù)據(jù)分組的方法----Multiscale PHATE柬泽,文章在Multiscale PHATE identifies multimodal signatures of COVID-19慎菲,發(fā)表在Nature Biotechnology锨并,IF 54分露该,相當(dāng)好的方法第煮,推薦給大家解幼,也可以參考一下我之前寫(xiě)的文章10X單細(xì)胞降維分析之PHATE包警。
研究背景
- 當(dāng)前用于降維和數(shù)據(jù)探索的工具撵摆,包括 t 分布隨機(jī)鄰域嵌入 (t-SNE)害晦、統(tǒng)一流形逼近和投影 (UMAP) 和主成分分析 (PCA)特铝,僅顯示數(shù)據(jù)的a single level of granularity篱瞎。(當(dāng)然苟呐,實(shí)際運(yùn)用仍然有缺陷)俐筋。
- Multiscale PHATE牵素,一種可以在所有精度級(jí)別學(xué)習(xí)和可視化抽象細(xì)胞特征和數(shù)據(jù)分組的方法澄者。 算法基于稱為擴(kuò)散凝聚(diffusion condensation)的動(dòng)態(tài)拓?fù)溥^(guò)程笆呆,該過(guò)程將數(shù)據(jù)點(diǎn)緩慢凝聚到局部重心粱挡,以形成自然的赠幕、數(shù)據(jù)驅(qū)動(dòng)的跨精度分組询筏。這種粗粒度過(guò)程通過(guò)允許細(xì)胞在連續(xù)的集中步驟過(guò)程中自然地聚集在一起榕堰,不斷學(xué)習(xí)底層數(shù)據(jù)集的拓?fù)浣Y(jié)構(gòu),從而允許探索其他方法無(wú)法揭示的更連續(xù)的精度范圍逆屡。
各種降維方法的優(yōu)劣我之前分享過(guò)圾旨,現(xiàn)在把PPT放在下面魏蔗,大家自己查看
Multiscale PHATE algorithm
Multiscale PHATE 將一種稱為擴(kuò)散凝聚(diffusion condensation)的數(shù)據(jù)粗粒度方法與一種稱為 PHATE 的保持流形的降維方法相結(jié)合,以產(chǎn)生多粒度的可視化和高維生物數(shù)據(jù)cluster莺治。 Multiscale PHATE 算法可以分解為四個(gè)步驟:
- 1廓鞠、compute a manifold-intrinsic, diffusion potential representation that learns the nonlinear biological manifold as done in PHATE(計(jì)算一個(gè)流形固有的擴(kuò)散勢(shì)表示谣旁,該表示學(xué)習(xí)非線性生物流形床佳,參考文章10X單細(xì)胞降維分析之PHATE)
- coarse grain this diffusion potential using a fast diffusion condensation process(降低精度)
- select meaningful resolutions for downstream analysis with a gradient-based approach(迭代到一定的程度蔓挖,合并高精度的細(xì)胞稱為一個(gè)整體)
- visualize condensed diffusion potential coordinates at selected scales via metric multidimensional scaling (MMDS) and analyze coarser-grain resolutions to obtain multiscale clusters(通過(guò)度量多維尺度 (MMDS) 可視化選定尺度的凝聚擴(kuò)散勢(shì)坐標(biāo)夕土,并分析粗粒度分辨率以獲得多尺度clusters)瘟判。
Multiscale PHATE 首先創(chuàng)建原始數(shù)據(jù)的diffusion potential representation U,
(1)first, a distance matrix is calculated between all cells based on their ambient measurements. Distance matrix is converted into affinity matrix using anadaptive-bandwidth Gaussian kernel function so that similarity between two cells decreases exponentially with their distance.(是不是感覺(jué)很熟悉拷获,類似于KNN)
(2)Next, is row normalized to obtain the diffusion operator , representing the probability distribution of transitioning from one cell to another in a single step. This diffusion operator is raised to tD, the PHATE optimal diffusion timescale as computed by von Neumann entropy, to simulate a tD-step random walk over the data graph.(算法的內(nèi)容還是有點(diǎn)難以理解)。
(3)Finally, by taking logarithm of PtD , we calculate the diffusion potential of the data.
以前的工作表明匆瓜,在 PHATE 中計(jì)算的這種內(nèi)部表示有效地學(xué)習(xí)了復(fù)雜生物數(shù)據(jù)集的非線性幾何,并且可以使用 MMDS 在二維或三維中快速可視化未蝌。 (PHATE在降維方面確實(shí)有比較好的一面)Multiscale PHATE uses this diffusion potential representation as the substrate for our diffusion condensation process.正如擴(kuò)散勢(shì)計(jì)算所做的那樣,擴(kuò)散凝聚在每次迭代時(shí)使用來(lái)自擴(kuò)散勢(shì)空間中細(xì)胞位置的fixed-bandwidth Gaussian kernel function計(jì)算diffusion operator Pt萧吠。使用fixed bandwidth可以衡量計(jì)算細(xì)胞-細(xì)胞親和性的局部性。該diffusion operator Pt 應(yīng)用于擴(kuò)散勢(shì) Ut纸型,充當(dāng)擴(kuò)散濾波器拇砰,有效地用其擴(kuò)散鄰居的加權(quán)平均值替換點(diǎn)的坐標(biāo)狰腌。 當(dāng)兩個(gè)細(xì)胞之間的距離低于距離閾值時(shí)除破,細(xì)胞將合并在一起琼腔,表示它們屬于同一個(gè)clusters瑰枫。 然后迭代地重復(fù)這個(gè)過(guò)程丹莲,直到所有細(xì)胞都折疊成一個(gè)cluster光坝。
通過(guò)對(duì)擴(kuò)散勢(shì)進(jìn)行去噪,Multiscale PHATE 解決了原始diffusion condensation的兩個(gè)缺點(diǎn)教馆。
-
Diffusion condensation in its original form is not effective at learning or visualizing the nonlinear geometry of biological datasets and is prone to(容易發(fā)生) condensing points off the data manifold
- 通過(guò)首先通過(guò)擴(kuò)散勢(shì)計(jì)算學(xué)習(xí)非線性數(shù)據(jù)流形并將其輸入到擴(kuò)散凝聚中逊谋,不僅有效地學(xué)習(xí)了復(fù)雜數(shù)據(jù)集的非線性幾何土铺,而且在感興趣的分辨率下快速可視化和learn clusters胶滋。
為了識(shí)別有意義的尺度悲敷,應(yīng)用了基于梯度的方法究恤,確定了用于下游分析的condensation process的穩(wěn)定分辨率后德。 這些分辨率的可視化是通過(guò)計(jì)算潛在距離矩陣DUt 來(lái)實(shí)現(xiàn)的部宿,使用Ut 中的行對(duì)之間的距離(也就是細(xì)胞之間的距離)瓢湃。最后理张,通過(guò)執(zhí)行 MMDS 來(lái)獲得多尺度 PHATE 可視化绵患,以保留 DUt 內(nèi)的二維或三維距離并準(zhǔn)備可視化雾叭。 因此,在 Multiscale PHATE 中织狐,我們不僅能夠沿著數(shù)據(jù)流形計(jì)算連貫的數(shù)據(jù)拓?fù)洌€能夠快速可視化condensation process的中間層筏勒。 使用已知cluster的stochastic block model,表明管行,隨著將越來(lái)越多的噪聲添加到模型中,使用擴(kuò)散勢(shì)初始化的diffusion condensation 優(yōu)于環(huán)境測(cè)量空間上的diffusion condensation病瞳。
有關(guān) Multiscale PHATE 的普遍性揽咕、可擴(kuò)展性和可重復(fù)性的更多詳細(xì)信息套菜,如下圖速梗。
方法之間的比較
- 衡量標(biāo)準(zhǔn)运杭,adjusted Rand index (ARI) and F1 scores(關(guān)于ARI院塞,參考文章調(diào)整蘭德系數(shù)(Adjusted Rand index,ARI)的計(jì)算渣蜗、關(guān)于F1 scores屠尊,參考文章機(jī)器學(xué)習(xí)中的F1-score)
Multiscale PHATE的優(yōu)點(diǎn) 1 耕拷,preserved local and global distances(單細(xì)胞數(shù)據(jù)讼昆,這個(gè)好像UMAP的特點(diǎn)骚烧。
在幾乎所有生物噪聲范圍內(nèi)浸赫,Multiscale PHATE 的表現(xiàn)都優(yōu)于其他方法赃绊。 特別是既峡,Multiscale PHATE 在可視化具有高度噪聲的數(shù)據(jù)方面具有明顯優(yōu)勢(shì)。Across our comparisons, Multiscale PHATE similarly performed as well or better than other visualization modalities, especially as noise increased within the dataset运敢。
Multiscale clusters accurately captured established groupings of data.
(1)噪音合成數(shù)據(jù)和two- and three-layer hierarchical stochastic block models
當(dāng)然了,文章介紹了一些方法的實(shí)際運(yùn)用传惠,尤其在SARS-CoV-2 單細(xì)胞數(shù)據(jù)中的運(yùn)用,取得了很好的效果档痪,這個(gè)我們就不過(guò)多介紹了。
Discussion(這算法真的挺難的)
在這里腐螟,提出了一種多尺度數(shù)據(jù)探索技術(shù),用于可視化乐纸、聚類和比較大規(guī)模數(shù)據(jù)集,填補(bǔ)了生物數(shù)據(jù)探索的關(guān)鍵空白摇予。Multiscale PHATE 發(fā)現(xiàn)了可預(yù)測(cè)臨床結(jié)果的不同尺度的數(shù)據(jù)分組。生物數(shù)據(jù)自然包含多粒度結(jié)構(gòu)侧戴。然而宁昭,大多數(shù)分析方法酗宋,無(wú)論是聚類還是降維算法积仗,通常只關(guān)注單一級(jí)別的分辨率蜕猫,并沒(méi)有提供探索不同尺度的系統(tǒng)方法寂曹。層次聚類是一種可以提供一定分辨率的方法。然而隆圆,由于層次聚類方法(例如漱挚,Louvain)中發(fā)生的不斷合并渺氧,錯(cuò)過(guò)了許多分辨率級(jí)別旨涝,并且沒(méi)有概括生物學(xué)相關(guān)的粒度級(jí)別侣背。相比之下颊糜,Multiscale PHATE 提供了一種基于流形學(xué)習(xí)的快速技術(shù)秃踩,用于通過(guò)了解數(shù)據(jù)拓?fù)鋪?lái)揭示結(jié)構(gòu)和特征的連續(xù)分辨率
衬鱼。分析表明,多尺度 PHATE 可以與其他技術(shù)相結(jié)合鸟赫,例如 MELD 和互信息 (DREMI),以提供對(duì)生物過(guò)程的深入和詳細(xì)的見(jiàn)解消别。借助 Multiscale PHATE,這些工具允許用戶找到自然捕捉患者之間顯著差異的解決方案寻狂,跨尺度分離致病性和保護(hù)性細(xì)胞亞群岁经,并識(shí)別與疾病相關(guān)的關(guān)鍵標(biāo)志物蛇券。
Methods(痛苦的數(shù)學(xué))
最后來(lái)看看示例代碼,鏈接在Multiscale PHATE
算法將降維技術(shù) PHATE 與多粒度分析工具diffusion condensation相結(jié)合纠亚。 首先使用 PHATE 計(jì)算非線性擴(kuò)散流形塘慕。 然后蒂胞,擴(kuò)散凝聚利用這個(gè)流形內(nèi)在擴(kuò)散空間图呢,將數(shù)據(jù)點(diǎn)緩慢凝聚到局部重心骗随,形成跨多個(gè)粒度的自然蛤织、數(shù)據(jù)驅(qū)動(dòng)的分組。 然后可以查看這些粒度指蚜。
使用梯度分析,觀察diffusion condensation過(guò)程連續(xù)迭代期間數(shù)據(jù)密度的變化牡昆,我們可以確定分層樹(shù)的穩(wěn)定分辨率以進(jìn)行下游分析摊欠。 有了這些穩(wěn)定性信息,我們可以在多個(gè)分辨率下切割層次樹(shù)柱宦,以生成跨粒度的可視化和集群,用于下游分析掸刊。
通過(guò)識(shí)別多種分辨率,Multiscale PHATE 使用戶能夠與其數(shù)據(jù)交互并放大感興趣的細(xì)胞子集忧侧,以揭示有關(guān)細(xì)胞類型和子類型的越來(lái)越精細(xì)的信息石窑。 當(dāng)與其他用于高維數(shù)據(jù)分析的計(jì)算算法(如 MELD 和 DREMI)結(jié)合使用時(shí)蚓炬,Multiscale PHATE 能夠提供對(duì)生物過(guò)程的深入而詳細(xì)的見(jiàn)解松逊。
安裝
pip install --user git+https://github.com/KrishnaswamyLab/Multiscale_PHATE
Quick Start
import multiscale_phate
mp_op = multiscale_phate.Multiscale_PHATE()
mp_embedding, mp_clusters, mp_sizes = mp_op.fit_transform(X)
# Plot optimal visualization
scprep.plot.scatter2d(mp_embedding, s = mp_sizes, c = mp_clusters,
fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(16,12))
分解一下
加載
import multiscale_phate as mp
import numpy as np
import pandas as pd
import scprep
import os
示例數(shù)據(jù)肯夏,10X pbmc數(shù)據(jù)
## Save data directory
data_dir = os.path.expanduser("~/multiscale_phate_data") # enter path to data directory here (this is where you want to save 10X data)
if not os.path.isdir(data_dir):
os.mkdir(data_dir)
file_name = '10X_pbmc_data.h5'
file_path = os.path.join(data_dir, file_name)
URL = 'https://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices_h5.h5'
scprep.io.download.download_url(URL, file_path)
data = scprep.io.load_10X_HDF5(file_path, gene_labels='both')
data.head()
質(zhì)控和預(yù)處理
data = scprep.filter.filter_library_size(data, cutoff=1000, keep_cells='above')
data = scprep.filter.filter_rare_genes(data)
data_norm, libsize = scprep.normalize.library_size_normalize(data, return_library_size=True)
data_sqrt = np.sqrt(data_norm)
data_sqrt.head()
Creating multi-resolution embeddings and clusters with Multiscale PHATE
Computing Multiscale PHATE tree involves two successive steps:
- Building the Multiscale PHATE operator
- Fitting your data with the operator to construct a diffusion condensation tree and running gradient analysis to identify stable resolutions for downstream analysis
Here we set the random_state to enhance reproducibility.
mp_op = mp.Multiscale_PHATE(random_state=1)
levels = mp_op.fit(data_sqrt)
In order to identify salient levels of the diffusion condensation tree, we can visualize the output of our gradient analysis and highlight stable resolutions for downstream analysis:
import matplotlib.pyplot as plt
ax = plt.plot(mp_op.gradient)
ax = plt.scatter(levels, mp_op.gradient[levels], c = 'r', s=100)
Visualizing full Diffusion Condensation tree
由于 Diffusion Condensation 創(chuàng)建了細(xì)胞和集群的層次結(jié)構(gòu),因此可視化這棵樹(shù)并將迭代或集群標(biāo)簽映射到這棵樹(shù)上會(huì)很有用驯击。 我們可以首先使用 build_tree() 函數(shù)構(gòu)建樹(shù):
### building tree
tree = mp_op.build_tree()
scprep.plot.scatter3d(tree, s= 50,
fontsize=16, ticks=False, figsize=(10,10))
It can also be useful to color the tree with various labels, such as diffusion condensation iteration and by a particular layer of the tree. Since the tree is effectively a series of stacked 2D condensed points, coloring the tree by the third column will color each point by its corresponding iteration:
scprep.plot.scatter3d(tree, c = tree[:,2], s= 50,
fontsize=16, ticks=False, figsize=(10,10))
In order to color the tree by clusters found at a paticular granularity of the Diffusion Condensation tree, we simply pass a resolution identified by the gradient analysis to the .get_tree_clusters() function and color our tree embedding with the result. Play around with the clustering level you pass to the .get_tree_clusters() function and see what happens:
tree_clusters = mp_op.get_tree_clusters(levels[9])
scprep.plot.scatter3d(tree, c = tree_clusters, s= 50,
fontsize=16, ticks=False, figsize=(10,10))
Visualizing Coarse Granularity
Now we are ready to produce an initial coarse embedding of the dataset. When running the .transform() function, we select a coarse resolution for our clusters (level 136 - the 9th salient resolution identified in this example) and a finer resoultion for our embedding (level 53 - the 2nd salient resolution identified in this example). By modifying the ideal resolutions passed to the .transform() function, we can modify the granularity of the visualization and clusters, producing coarser or finer embeddings and groupings of the data. We recommend playing around with these resolutions.
coarse_embedding, coarse_clusters, coarse_sizes = mp_op.transform(visualization_level = levels[2],cluster_level = levels[9])
scprep.plot.scatter2d(coarse_embedding, s = 100*np.sqrt(coarse_sizes), c = coarse_clusters,
fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(10,8))
Next, we can identify specific clusters to cell types by mapping and visualizing the expression of key marker genes for T cells (CD3E), B cells (CD19) and Monocytes (CD14) to our coarse embedding. To run this mapping we run the .get_expression() function by passing the full expression vector from single cells as well as the resolution of the visualization
We would like to note that you can perform MELD (Burkhardt et al. 2021) at this resolution as well by running get_expression() on a binarized perturbation signal [0,1] that denotes the perturbation of origin for a given cell.
coarse_expression = pd.DataFrame()
coarse_expression['CD3E'] = mp_op.get_expression(data_sqrt['CD3E (ENSG00000198851)'].values,
visualization_level = levels[2])
coarse_expression['CD19'] = mp_op.get_expression(data_sqrt['CD19 (ENSG00000177455)'].values,
visualization_level = levels[2])
coarse_expression['CD14'] = mp_op.get_expression(data_sqrt['CD14 (ENSG00000170458)'].values,
visualization_level = levels[2])
fig, axes = plt.subplots(1,3, figsize=(14, 4))
genes = ['CD3E', 'CD19', 'CD14']
for i, ax in enumerate(axes.flatten()):
scprep.plot.scatter2d(coarse_embedding, s = 25*np.sqrt(coarse_sizes),
c=coarse_expression[genes[i]], legend_anchor=(1,1), ax=ax, title=genes[i],
xticks=False, yticks=False, label_prefix="PHATE", fontsize=16, cmap = 'RdBu_r')
fig.tight_layout()
Visualizing Fine Granularity
Next, multiscale PHATE allows users to 'zoom in' on populations of interest and perform finer grained analysis using the .transform() and .get_expression() functions.
Using these the .transform() function can get a little confusing. Essentially, we have select a coarse resolution of clusters (coarse_cluster_level) and then a cluster of interest to zoom in on in this resolution (coarse_cluster). Then, we can embed this population at a finer resolution (visualization_level as before) and a finer resolution of clusters (cluster_level). Again, please play around with each of these parameters to embed different clusters across granularities:
zoom_embedding, zoom_clusters, zoom_sizes = mp_op.transform(visualization_level=levels[1],
cluster_level=levels[2],
coarse_cluster_level=levels[9],
coarse_cluster=8)
scprep.plot.scatter2d(zoom_embedding, s = 500*np.sqrt(zoom_sizes), c = zoom_clusters,
fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(10,8))
Next, we can identify the identities of subpopulations of interest by mapping the expression of known markers. This is done using the get_expression() function but, as with the .transform() function, we also have to pass coarse_cluster_level and coarse_cluster to indicate which population we intend to zoom in on.
In this case, we zoom into B cells and map the expression of key genes to identify B cell subpopulations - CD19 for Naive B cells, CD20 (gene name MS5A1) for Activated B cells and CD27 for Memory B cells:
fine_expression = pd.DataFrame()
fine_expression['CD19'] = mp_op.get_expression(data_sqrt['CD19 (ENSG00000177455)'].values,
visualization_level = levels[1],
coarse_cluster_level=levels[9],
coarse_cluster=8)
fine_expression['CD27'] = mp_op.get_expression(data_sqrt['CD27 (ENSG00000139193)'].values,
visualization_level = levels[1],
coarse_cluster_level=levels[9],
coarse_cluster=8)
fine_expression['CD20'] = mp_op.get_expression(data_sqrt['MS4A1 (ENSG00000156738)'].values,
visualization_level = levels[1],
coarse_cluster_level=levels[9],
coarse_cluster=8)
fig, axes = plt.subplots(1,3, figsize=(14, 4))
genes = ['CD19','CD27','CD20']
for i, ax in enumerate(axes.flatten()):
scprep.plot.scatter2d(zoom_embedding, s = 50*np.sqrt(zoom_sizes),
c=fine_expression[genes[i]], legend_anchor=(1,1), ax=ax, title=genes[i],
xticks=False, yticks=False, label_prefix="PHATE", fontsize=16, cmap = 'RdBu_r')
fig.tight_layout()
生活很好徊都,有你更好