hello 翘狱,童鞋們,今天我們來分享一個(gè)有關(guān)擬時(shí)分析的軟件-----VIA策菜,文章在VIA: Generalized and scalable trajectory inference in single-cell omics data,這個(gè)方法我認(rèn)為很經(jīng)典穿香,值得大家學(xué)習(xí)袜蚕,我們還是先分享文獻(xiàn),最后示例參考代碼亩钟。
Abstract
Inferring cellular trajectories using a variety of omic data is a critical task in single-cell data science.However, accurate prediction of cell fates, and thereby biologically meaningful discovery, is challenged by the sheer size of single-cell data, the diversity of omic data types, and t he complexity of their topologies(這個(gè)地方大家應(yīng)該深有感觸乓梨,我們通常用的擬時(shí)分析軟件都無法準(zhǔn)確的得到軌跡發(fā)生的真相,參數(shù)一調(diào)清酥,結(jié)果完全不一樣了). We present VIA, a scalable trajectory inference algorithm that overcomes these limitations by using lazy-teleporting random walks(這個(gè)方法也是第一次聽說) to accurately reconstruct complex cellular trajectories beyond tree-like pathways (e.g. cyclic or disconnected structures). We show that VIA robustly and efficiently unravels the fine-grained sub-trajectories in a 1.3-million-cell(通量確實(shí)很高啊) transcriptomic mouse atlas without losing the global connectivity at such a high cell count(能夠保留整體分布結(jié)構(gòu)). We further apply VIA to discovering elusive lineages and less populous cell fates missed by other methods across a variety of data types(少量稀有細(xì)胞的軌跡發(fā)生督禽,這個(gè)很重要), including single-cell proteomic, epigenomic, multi-omics datasets, and a new in-house single-cell morphological dataset。
Background
Single-cell omics data captures snapshots of cells t hat catalog cell types and molecular states with high precision.但是目前的軟件算法面臨4個(gè)挑戰(zhàn):
(1)it remains difficult to accurately reconstruct high-resolution cell trajectories and detect the pertinent cell fates and lineages without relying on prior knowledge of input parameter settings(先驗(yàn)知識(shí)和參數(shù)調(diào)整確實(shí)是目前擬時(shí)分析的痛點(diǎn)总处,一般做軌跡分析都需要先定義才可以狈惫,而且參數(shù)的影響非常大),This is a foundational but unmet attribute of trajectory inference (TI) that could make lineage prediction less biased towards input parameters, and thus minimize the confounding factors that impact the underlying hypothesis testing 鹦马,However, even the few algorithms which automate cell fate detection (e.g., SlingShot , Palantir and Monocle3) exhibit low sensitivity to cell fates and are highly susceptible to changes in input parameters(這個(gè)也是通病胧谈,整體結(jié)構(gòu)保持的前提下,稀有細(xì)胞類型就被“合并”了荸频,做過的都知道)菱肖。
(2)Second, current trajectory inference (TI) methods predominantly work well o n tree-like trajectories (e.g. Slingshot), but lack the generalisability to infer disconnected, cyclic or hybrid topologies without imposing restrictions on transitions and causality (這個(gè)也是痛點(diǎn),聽過我對(duì)于細(xì)胞軌跡分析課程的同學(xué)來說應(yīng)該都知道這一點(diǎn)旭从,可參考文章10X單細(xì)胞軌跡分析之回顧).This attribute is crucial in enabling unbiased discovery of complex trajectories which are commonly not well known a priori, especially given the increasing diversity of single-cell omic datasets.(說白了就是對(duì)于復(fù)雜軌跡的判斷仍然是一個(gè)大問題)稳强。
(3)Third, the growing scale of single-cell data,exceeds the existing T I capacity, not just in runtime and memory, but in preserving both the fine-grain resolution of the embedded trajectories and the global connectivity among them(這個(gè)也是問題,典型如monocle2和悦,過多的細(xì)胞無法運(yùn)算)退疫。such global information is lost in current TI methods after extensive dimension reduction or subsampling.(無法很好的運(yùn)算當(dāng)然就是結(jié)構(gòu)的丟失)。
(4)fueling the advance in single-cell technologies is the ongoing pursuit to understand cellular heterogeneity from a broader perspective beyond transcriptomics(技術(shù)上的需求)鸽素。數(shù)據(jù)的多樣化確實(shí)帶來了這樣的問題褒繁。However, the applicability of TI to a broader spectrum of single-cell data has yet to be fully exploited。
于是作者新開發(fā)了一個(gè)方法馍忽,VIA棒坏,reconstruct cell lineages based on lazy-teleporting random walks integrated with Markov chain Monte Carlo (MCMC) refinement燕差,當(dāng)然,后面說了很多軟件的有點(diǎn)坝冕。注意一下這里allows capture of cellular trajectories not only in multi-furcations and trees, but also in disconnected and cyclic topologies.可以說是巨大的進(jìn)步徒探。
有點(diǎn)除了上面這個(gè)以外,還有robust to a wide r ange of pre-processing and input algorithmic parameters喂窟,allow VIA to consistently identify pertinent lineages that remain elusive刹帕。后面有一些數(shù)據(jù)上的運(yùn)用。resilience in handling the wide disparity in single-cell data size structure a nd dimensionality a cross modalities.(通量很大谎替,這個(gè)很可以)偷溺,VIA is highly scalable with respect to number of cells (102 to 106 cells) and f eatures, without requiring extensive d imensionality reduction or subsampling which compromise global information.(這個(gè)通量真的是相當(dāng)可以),當(dāng)然還有一些新技術(shù)上的運(yùn)用钱贯。
Result
Algorithm
我們來看看算法過程:
(1)Step 1 : Single-cell l evel g raph i s c lustered s uch t hat e ach node represents a c luster of single cells ( computed by our clustering algorithm PARC) . The resulting cluster graph forms the basis for subsequent random walks(隨機(jī)游走這個(gè)概念不知道大家聽過多少挫掏,這里就是為了聚類)。
(2)Step 2 : 2-stage pseudotime computation: (i) The pseudotime ( relative to a user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk a long an undirected graph. At each step, the walk ( with small probability) can remain ( orange arrows) or teleport ( red arrows) to any other state. (ii) Edges are then forward biased based on the expected hitting time ( See forward biased edges illustrated as the imbalance of double-arrowhead size). The pseudotime is further refined on the directed graph by running Markov chain Monte Carlo ( MCMC) simulations ( See 3 highlighted paths starting at root).(這個(gè)地方其實(shí)運(yùn)用到了一些算法秩命,比較復(fù)雜尉共,但是直觀來看更加科學(xué))。
(3)Step 3 : Consensus vote on terminal states based on vertex connectivity properties of the directed graph(準(zhǔn)確檢測(cè)末端細(xì)胞弃锐,這個(gè)大部分軟件都無法做到)袄友。
(4)Step 4 : lineage likelihoods computed as the visitation frequency under lazy-teleporting MCMC simulations.(構(gòu)建軌跡)。
(5)可視化和下游分析霹菊。
我們來詳細(xì)分析一下
VIA first represents t he single-cell data as a cluster graph(這個(gè)地方的聚類信息需要我們用其他軟件來做)剧蚣,computed by our recently developed data-driven community-detection algorithm,PARC, which allows scalable clustering whilst preserving global properties of the t opology needed for accurate TI(當(dāng)然不必非要用到這個(gè)軟件推薦的方法)。
The cell fates and their lineage pathways are then c omputed by a two-stage probabilistic method(概率方法),which is the key algorithmic contribution of this work In the first stage o f Step 2, VIA models the cellular process as a modified random walk that allows degrees of laziness (remaining at a node/state) and teleportation (jumping to a ny other node/state) with pre-defined probabilities.(這個(gè)地方其實(shí)會(huì)構(gòu)建更加復(fù)雜的軌跡結(jié)構(gòu)旋廷,有局部鸠按,有跳躍,有循環(huán))饶碘。The pseudotime, and thus the graph directionality, can be computed based on the theoretical hitting times of nodes目尖。(軌跡圖標(biāo)方向性的構(gòu)建)。The lazy-teleporting behavior prevents the expected hitting time from converging(匯聚) to a local distribution in the graph as otherwise occurs in regular random walks, especially when the sample size grows扎运。(這個(gè)措施其實(shí)還是很需要的)瑟曲。More specifically, the laziness and teleportation factors regulate the weights given to each eigenvector-value pair in the expected hitting time formulation(配方,公式豪治,表達(dá)方式) such that the stationary distribution (given by the local-node degree-properties in regular walks) does not overwhelm the global information provided by other ‘eigen-pairs’.(不知道大家懂這里在說什么么洞拨??鬼吵?需要一些數(shù)學(xué)的基礎(chǔ)知識(shí))扣甲。Moreover, the computation does not require subsetting the first k eigenvectors (bypassing the need for the user to select a suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of number of cells, but is equal to the number of clusters(做過單細(xì)胞的同學(xué)這個(gè)地方應(yīng)該很熟悉吧)。Hence all eigenvalue-eigenvector pairs can be incorporated without causing a bottleneck in runtime.Consequently in VIA, the modified walk on a cluster-graph not only enables scalable pseudotime computation for large datasets in terms of runtime, but also preserves information about t he global neighborhood relationships within the graph.(看來能處理大規(guī)模的細(xì)胞量是有原因的齿椅,能夠保持全局結(jié)構(gòu)也是采用了特征向量)琉挖。In the second stage of Step 2, VIA infers the directionality of the graph by biasing the edge-weights with the initial pseudotime computations, and refines the pseudotime through lazy-teleporting MCMC simulations on the forward biased graph.(這是結(jié)果了)。
the MCMC-refined graph-edges of the lazy-teleporting r andom walk enable accurate predictions of terminal cell fates through a consensus vote of various vertex connectivity properties derived from the directed graph.(稀有細(xì)胞的檢測(cè))涣脚。The cell fate predictions obtained using this approach are more accurate and robust to changes in input d ata and parameters compared to other TI methods(穩(wěn)定性和準(zhǔn)確性更高)示辈。Trajectories towards identified terminal states are then r esolved using lazy-teleporting MCMC simulations。Together, these four steps facilitate holistic(整體的) topological visualization of TI on the single-cell level (e.g. using UMAP or PHATE(注意這里遣蚀,個(gè)人推薦使用PHATE矾麻,可以參考文章10X單細(xì)胞降維分析之PHATE))以及一些下游的分析。
理論部分全是精華芭梯,需要很深的基礎(chǔ)才可以理解险耀,大家繼續(xù)加油。
VIA accurately captures complex topologies obscured in other TI methods
我們來看看方法之間的比較
In a 4 -leaf multifurcation topology玖喘,當(dāng)然這個(gè)數(shù)據(jù)是模擬的(下圖)
VIA accurately captures the two cascading bifurcations which lead to 4 leaf nodes甩牺。
In particular, VIA detects the elusive ‘M2’ terminal s tate whereas other methods (Palantir, PAGA, Slingshot a nd Monocle3) merge it with the ‘M8’ lineage (末端的檢測(cè)更為準(zhǔn)確,但這里是作者自己測(cè)的累奈,可靠性性如何需要我們自己的檢驗(yàn))贬派。Monocle3 typically only captures a single bifurcation and thus merges the pairs of leaves that otherwise arise from t he second layer of bifurcation (monocle3這個(gè)軟件局限性特別大,不太建議大家使用)澎媒。
Even for the fairly simple cyclic topology搞乏,(下圖):
other methods tend to fragment the structure to varying degrees depending on the parameter choice whereas VIA consistently preserves the global cyclic structure. (其實(shí)環(huán)狀結(jié)構(gòu)的發(fā)育軌跡我們能遇到的非常少,能夠檢測(cè)的軟件就更少了)戒努。This is not to say VIA is invariant to parameter choice, but rather that VIA p redictably modulates the graph resolution across a wide range of K without disrupting the underlying global topology(see the increase in the number of nodes in K=30 versus K=5) (全局的判斷更為準(zhǔn)確)This characteristic is important for robustly analyzing multiple levels of resolution in complex graph topologies, as also shown in our later investigation of the 1.3-million-cell mouse atlas.(這個(gè)是作者的觀點(diǎn)请敦,個(gè)人非常贊同)。We quantify graph-edge accuracy in the cyclic and disconnected datasets by identifying false/true positive/negative edges relative to the reference truth in order to compute an F1-score The performance comparison for the disconnected hybrid topologies shows that VIA disentangles the cyclic and b ifurcating lineages and captures the key leaf-states in the bifurcation as well as the ‘tail’ extending from the cyclic topology Palantir overly fragments the two trajectories, whereas Monocle3 and Slingshot merge them.(其實(shí)這種情況才是我們通常會(huì)遇到的情況储玫,然而我們一般采用monocle2的方法把軌跡擬合到了一起冬三,得到的結(jié)果其實(shí)是錯(cuò)誤的)。
上面采用的數(shù)據(jù)是模擬數(shù)據(jù)缘缚,不能表示在真正運(yùn)用的過程中就會(huì)非常好勾笆。
VIA reveals rare lineages in epigenomic and transcriptomic l andscapes of human hematopoiesis.(第一個(gè)實(shí)例)。
檢測(cè)到了典型的分化結(jié)構(gòu)桥滨,The automated detection of these terminal states in VIA, as quantified by F1-scores on the annotated cells, remains robust to varying the number of neighbors in the KNN graph, and the number of principal components (PCs)(上面圖C)窝爪。Specifically, VIA’s sustained sensitivity to rarer cell types can be attributed t o a better underlying graph structure w here nodes are well delineated by PARC (as rare cell types are well separated by graph pruning in the clustering stage) and edges are not prematurely removed due to restrictions on causality(作者把軌跡結(jié)構(gòu)的準(zhǔn)確性歸功于PRPC的降維)。
后面和其它軟件的比較齐媒,不用想肯定VIA最好蒲每,不然發(fā)不出來??。
后面的實(shí)際運(yùn)用喻括,我們簡(jiǎn)單看看吧
VIA detects small endocrine Delta lineages and Beta subtypes
這部分的結(jié)果主要是對(duì)稀有細(xì)胞群檢測(cè)的準(zhǔn)確性邀杏。
To show that this is not a co-incidence of parameter choice, we verify that these populations can be identified for a wide range of chosen highly variable genes (HVGs) and number of PCs。
VIA recovers Isl1+ cardiac progenitor bifurcation in multi-omic data(多組學(xué)數(shù)據(jù),這里主要就是scRNA和scATAC)望蜡。
VIA preserves global connectivity when scaling to millions of cells(這部分講擴(kuò)展性示例)
當(dāng)然唤崭,軟件采用的主成分進(jìn)行推斷軌跡,這一點(diǎn)是應(yīng)該的脖律。
VIA’s lazy-teleporting MCMCs delineate mesoderm differentiation i n m ass cytometry data
結(jié)果依舊良好谢肾。
VIA captures morphological trends of live cells in cell cycle progression(細(xì)胞周期的分化)
這部分的個(gè)性化分析很高。
最后小泉,我們來看看代碼
import pyVIA.core as via
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# 5780 cells x 14651 genes Human Replicate 1. Male african american, 38 years
ad = sc.read( '/home/shobi/Trajectory/Datasets/HumanCD34/human_cd34_bm_rep1.h5ad')
# we use annotations given by SingleR which uses Novershtern Hematopoietic cell data as the reference. These are in line
# with the annotations given by Setty et al., 2019 but offer a slightly more granular annotation.
nover_labels = pd.read_csv('/home/shobi/Trajectory/Datasets/HumanCD34/Nover_Cor_PredFine_notLogNorm.csv')['x'].values.tolist()
dict_abb = {'Basophils': 'BASO1', 'CD4+ Effector Memory': 'TCEL7', 'Colony Forming Unit-Granulocytes': 'GRAN1',
'Colony Forming Unit-Megakaryocytic': 'MEGA1', 'Colony Forming Unit-Monocytes': 'MONO1',
'Common myeloid progenitors': "CMP", 'Early B cells': "PRE_B2", 'Eosinophils': "EOS2",
'Erythroid_CD34- CD71+ GlyA-': "ERY2", 'Erythroid_CD34- CD71+ GlyA+': "ERY3",
'Erythroid_CD34+ CD71+ GlyA-': "ERY1", 'Erythroid_CD34- CD71lo GlyA+': 'ERY4',
'Granulocyte/monocyte progenitors': "GMP", 'Hematopoietic stem cells_CD133+ CD34dim': "HSC1",
'Hematopoietic stem cells_CD38- CD34+': "HSC2",
'Mature B cells class able to switch': "B_a2", 'Mature B cells class switched': "B_a4",
'Mature NK cells_CD56- CD16- CD3-': "Nka3", 'Monocytes': "MONO2",
'Megakaryocyte/erythroid progenitors': "MEP", 'Myeloid Dendritic Cells': 'mDC (cDC)', 'Na?ve B cells': "B_a1",
'Plasmacytoid Dendritic Cells': "pDC", 'Pro B cells': 'PRE_B3'}
#NOTE: Myeloid DCs are now called Conventional Dendritic Cells cDCs
nover_labels = [dict_abb[i] for i in nover_labels]
for i in list(set(nover_labels)):
print('the population of ', i, 'is ', nover_labels.count(i))
tsnem = ad.obsm['tsne']
adata_counts = sc.AnnData(ad.X) # ad.X is filtered, lognormalized,scaled// ad.raw.X is the filtered but not pre-processed
adata_counts.obs_names = ad.obs_names
adata_counts.var_names = ad.var_names
sc.tl.pca(adata_counts, svd_solver='arpack', n_comps=200)
true_label = nover_labels
ncomps=100#80
knn=30
v0_random_seed=4
v0 = via.VIA(adata_counts.obsm['X_pca'][:, 0:ncomps], true_label, jac_std_global=0.15, dist_std_local=1, knn=knn,
too_big_factor=0.3,
root_user=4823, dataset='humanCD34', preserve_disconnected=True, random_seed=v0_random_seed,
do_magic_bool=True, is_coarse=True,pseudotime_threshold_TS=20, neighboring_terminal_states_threshold=3) # *.4 root=1,
v0.run_VIA()
super_labels = v0.labels
df_ = pd.DataFrame(ad.X)
df_.columns = [i for i in ad.var_names]
print('start magic')
gene_list_magic = ['IL3RA', 'IRF8', 'GATA1', 'GATA2', 'ITGA2B', 'MPO', 'CD79B', 'SPI1', 'CD34', 'CSF1R', 'ITGAX']
df_magic = v0.do_magic(df_, magic_steps=3, gene_list=gene_list_magic)
df_magic_cluster = df_magic.copy()
df_magic_cluster['parc'] = v0.labels
df_magic_cluster = df_magic_cluster.groupby('parc', as_index=True).mean()
print('end magic', df_magic.shape)
f, ((ax, ax1)) = plt.subplots(1, 2, sharey=True, figsize = [20,10])
v0.draw_piechart_graph(ax, ax1, type_pt='gene', gene_exp=df_magic_cluster['GATA1'].values, title='GATA1')
tsi_list = via.get_loc_terminal_states(v0, adata_counts.obsm['X_pca'][:, 0:ncomps])
v1 = via.VIA(adata_counts.obsm['X_pca'][:, 0:ncomps], true_label, jac_std_global=0.15, dist_std_local=1, knn=knn,
too_big_factor=0.05, super_cluster_labels=super_labels, super_node_degree_list=v0.node_degree_list,
super_terminal_cells=tsi_list, root_user=4823,
x_lazy=0.95, alpha_teleport=0.99, dataset='humanCD34', preserve_disconnected=True,
super_terminal_clusters=v0.terminal_clusters, is_coarse=False, full_neighbor_array=v0.full_neighbor_array,
ig_full_graph=v0.ig_full_graph, full_distance_array=v0.full_distance_array,
csr_array_locally_pruned=v0.csr_array_locally_pruned,
random_seed=v0_random_seed,pseudotime_threshold_TS=20, neighboring_terminal_states_threshold=3) # *.4super_terminal_cells = tsi_list #3root=1,
v1.run_VIA()
#suppose you had a very large dataset, then you can use this to subsample before visualization
labels = v1.labels
idx = np.random.randint(len(labels), size=len(labels))
super_clus_ds_PCA_loc = via.sc_loc_ofsuperCluster_PCAspace(v0, v1, idx)
embedding = tsnem
via.draw_trajectory_gams(embedding[idx], super_clus_ds_PCA_loc, cluster_labels = np.asarray(labels)[idx], super_cluster_labels=np.asarray(v0.labels)[idx], super_edgelist= v0.edgelist_maxout,
x_lazy=v1.x_lazy, alpha_teleport=v1.alpha_teleport, projected_sc_pt=np.asarray(v1.single_cell_pt_markov)[idx], true_label=np.asarray(true_label)[idx], knn=v0.knn,
final_super_terminal=v1.revised_super_terminal_clusters,
sub_terminal_clusters=v1.terminal_clusters,
title_str='Hitting times: Markov Simulation on biased edges', ncomp=ncomps)
# DRAW EVOLUTION PATHS
knn_hnsw = via.make_knn_embeddedspace(embedding[idx])
via.draw_sc_evolution_trajectory_dijkstra(v1, embedding[idx], knn_hnsw, v0.full_graph_shortpath, idx)
plt.show()
gene_name_dict = {'GATA1': 'GATA1', 'GATA2': 'GATA2', 'ITGA2B': 'CD41 (Mega)', 'MPO': 'MPO (Mono)',
'CD79B': 'CD79B (B)', 'IRF8': 'IRF8 (DC)', 'SPI1': 'PU.1', 'CD34': 'CD34',
'CSF1R': 'CSF1R (cDC Up. Up then Down in pDC)', 'IL3RA': 'CD123 (pDC)', 'IRF4': 'IRF4 (pDC)',
'ITGAX': 'ITGAX (cDCs)', 'CSF2RA': 'CSF2RA (cDC)'}
for gene_name in ['ITGA2B', 'IL3RA', 'IRF8', 'MPO', 'CSF1R', 'GATA2', 'CD79B',
'CD34']: # ['GATA1', 'GATA2', 'ITGA2B', 'MPO', 'CD79B','IRF8','SPI1', 'CD34','CSF1R','IL3RA','IRF4', 'CSF2RA','ITGAX']:
print('gene name', gene_name)
# DC markers https://www.cell.com/pb-assets/products/nucleus/nucleus-phagocytes/rnd-systems-dendritic-cells-br.pdf
loc_gata = np.where(np.asarray(ad.var_names) == gene_name)[0][0]
subset_ = df_magic[gene_name].values
# print('shapes of magic_ad 1 and 2', magic_ad.shape,subset_.shape)
# v1.get_gene_expression(magic_ad,title_gene = gene_name_dict[gene_name])
v1.get_gene_expression(subset_, title_gene=gene_name_dict[gene_name])
主要的資料都在這里VIA.
生活很好芦疏,有你更好