什么是RNA 速率(RNA velocity)
RNA速率能夠通過疊加剪接信息來推斷細(xì)胞分化的方向性蒙揣。
隨著令人震驚的發(fā)現(xiàn)怯伊,未剪接和剪接的mRNA豐度可以在標(biāo)準(zhǔn)的單細(xì)protocols胞中進(jìn)行區(qū)分岸霹,La Manno等人(2018)引入了RNA速度(RNA velocity)的概念莉兰。通過將測量結(jié)果與潛在的mRNA剪接動力學(xué)聯(lián)系起來贞奋,探索定向軌跡的推斷:特定基因的轉(zhuǎn)錄誘導(dǎo)導(dǎo)致(新轉(zhuǎn)錄的)前體未剪接mRNA的增加窒篱,而相反恬涧,轉(zhuǎn)錄的抑制或缺失導(dǎo)致未剪接mRNA的減少注益。因此,通過將未剪接的mRNA與成熟的剪接mRNA進(jìn)行區(qū)分溯捆,可以近似地得到mRNA豐度的變化丑搔,即其時間導(dǎo)數(shù),即RNA速度提揍∑≡拢跨mrna的速度組合可以用來估計單個細(xì)胞的未來狀態(tài)。細(xì)胞的運動是通過將速度投影到低維嵌入中來實現(xiàn)的劳跃。
scVelo如何估計RNA速度
RNA速度估計目前有三種方法:
- 穩(wěn)態(tài)/確定性模型(在velocyto中使用)
- 隨機(jī)模型(使用二階矩)
- 動態(tài)模型(使用基于概率的框架)
velocyto中使用的穩(wěn)態(tài)/確定性模型對速度的估計如下:在假定轉(zhuǎn)錄階段(誘導(dǎo)和抑制)持續(xù)足夠長的時間以達(dá)到穩(wěn)態(tài)平衡(活躍和不活躍)的情況下谎仲,速度被量化為實際觀測值如何偏離穩(wěn)態(tài)平衡。平衡mRNA水平近似于在假定的上下分位數(shù)的穩(wěn)定狀態(tài)下的線性回歸刨仑。這種簡化是通過假設(shè)一個跨基因的通用剪接率和數(shù)據(jù)中反映的穩(wěn)態(tài)mRNA水平來實現(xiàn)的郑诺。這可能導(dǎo)致速度估計和細(xì)胞狀態(tài)的錯誤夹姥,因為這些假設(shè)經(jīng)常被違背,特別是當(dāng)一個種群包含多個異質(zhì)亞種群動態(tài)時辙诞。
隨機(jī)模型的目標(biāo)是更好地捕捉穩(wěn)態(tài)辙售,但與穩(wěn)態(tài)模型的假設(shè)相同。它是通過處理轉(zhuǎn)錄飞涂,剪接和降解作為概率事件旦部,從而納入二階矩。也就是說较店,穩(wěn)態(tài)水平不僅與mRNA水平近似士八,而且與內(nèi)在表達(dá)變異性近似。
動態(tài)模型(最強(qiáng)大的泽西,同時計算最耗資源)解決了每個基因的剪接動力學(xué)的全部動態(tài)曹铃。因此,它使RNA速度適應(yīng)廣泛變化的規(guī)格捧杉,如非平穩(wěn)群體陕见,因為它不依賴于限制一個共同的剪接率或穩(wěn)態(tài)被采樣。通過迭代估計反應(yīng)速率和潛在細(xì)胞特異性變量的可識別參數(shù)味抖,即轉(zhuǎn)錄狀態(tài)和細(xì)胞內(nèi)潛伏時間评甜,在基于概率的期望最大化框架中求解剪接動力學(xué)。更準(zhǔn)確地說仔涩,我們明確地模擬了四種轉(zhuǎn)錄狀態(tài)忍坷,以考慮基因活動的所有可能配置:兩個動態(tài)瞬態(tài)(誘導(dǎo)和抑制)和兩個穩(wěn)態(tài)(活躍和不活躍)在每個動態(tài)轉(zhuǎn)變后可能達(dá)到的狀態(tài)。對于每個狀態(tài)下的每個觀測值熔脂,計算一個模型的最優(yōu)潛伏期佩研,以獲得一個映射到非剪接/剪接動態(tài)的學(xué)習(xí)曲線上。然后霞揉,細(xì)胞到曲線的映射產(chǎn)生各自狀態(tài)的可能性旬薯,并通過最大可能的整體反應(yīng)速率參數(shù)。
這產(chǎn)生了更一致的速度估計和更好的識別轉(zhuǎn)錄狀態(tài)适秩。該模型進(jìn)一步能夠以一種基于概率的方式系統(tǒng)地識別動態(tài)驅(qū)動基因绊序,從而找到控制細(xì)胞命運轉(zhuǎn)變的關(guān)鍵驅(qū)動因素。此外秽荞,動態(tài)模型推斷了一個普遍的細(xì)胞內(nèi)潛伏時間共享的基因骤公,使相關(guān)基因和識別轉(zhuǎn)錄變化的機(jī)制(如分支點)。
為了得到最好的結(jié)果扬跋,我們顯然推薦使用更優(yōu)的動態(tài)模型阶捆。如果運行時很重要,我們建議使用隨機(jī)模型,它是用來近似動態(tài)模型的趁猴。隨機(jī)模型在30k細(xì)胞上用時不到一分鐘刊咳。然而,動力仍然可以花一個小時儡司,然而,提高效率正在進(jìn)行中余指。
安裝
#conda install -c conda-forge numba pytables louvain
pip install -U scvelo
## or # git clone git+https://github.com/theislab/scvelo
## pip install -e scvelo
比對
為了獲得MRNA剪切與為剪切的信息捕犬,我們需要比對后的bam文件,比對可以采用不同的方法:
scVelo in action
如果我們已經(jīng)熟悉scanpy的基本操作酵镜,理解scVelo 就不是難事了碉碉,因為二者的數(shù)據(jù)結(jié)構(gòu)是一樣的。
# -*- coding: utf-8 -*-
"""
Created on Fri Mar 27 08:34:49 2020
https://scvelo.readthedocs.io/
@author: Administrator
"""
import scvelo as scv
import scanpy as sc
scv.logging.print_version()
scv.settings.verbosity = 3 # show errors(0), warnings(1), info(2), hints(3)
scv.settings.set_figure_params('scvelo') # for beautified visualization
scv.settings.set_figure_params('scvelo')
# filename = 'E:\learnscanpy\write\pbmc3k.h5ad'
# adata = scv.read(filename, cache=False)
loomf = 'E:\learnscanpy\data\possorted_genome_bam_AYDSF.loom'
adata = scv.read(loomf, cache=False)
adata.var_names_make_unique
這里的possorted_genome_bam_AYDSF.loom是我們10X的數(shù)據(jù)比對后用velocyto處理得到的loom文件淮韭。
我們來看看數(shù)據(jù)結(jié)構(gòu):
adata
Out[6]:
AnnData object with n_obs × n_vars = 7292 × 33694
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
adata.var
Out[7]:
Accession Chromosome End Start Strand
FAM138A ENSG00000237613 1 36081 34554 -
RP11-34P13.7 ENSG00000238009 1 133723 89295 -
RP11-34P13.8 ENSG00000239945 1 91105 89551 -
RP11-34P13.14 ENSG00000239906 1 140339 139790 -
FO538757.2 ENSG00000279457 1 200322 184923 -
... ... ... ... ...
AC006386.1 ENSG00000279115 Y 25308107 25307702 +
AC006328.4 ENSG00000280301 Y 25473714 25463994 +
CSPG4P1Y ENSG00000240450 Y 25486705 25482908 +
CDY1 ENSG00000172288 Y 25624902 25622162 +
TTTY3 ENSG00000231141 Y 25733388 25728490 +
[33694 rows x 5 columns]
adata.layers['matrix']
Out[8]:
<7292x33694 sparse matrix of type '<class 'numpy.float32'>'
with 17201128 stored elements in Compressed Sparse Row format>
數(shù)據(jù)預(yù)處理
檢測基因選擇(用最少的計數(shù)檢測)和高變異性(離散度)垢粮。-根據(jù)每個細(xì)胞的初始大小對其進(jìn)行歸一化,并對X取對數(shù)靠粪。
scv.pp.filter_genes(adata, min_shared_counts=10)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
scv.pp.log1p(adata)
此外蜡吧,我們需要在PCA空間中計算最近鄰的一階和二階矩(基本上是均值和無中心方差)。確定性速度估計需要一階矩占键,而隨機(jī)估計也需要二階矩昔善。
scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
adata
Out[15]:
AnnData object with n_obs × n_vars = 7292 × 1999
obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm'
uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key'
obsm: 'X_pca'
varm: 'PCs'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu'
obsp: 'distances', 'connectivities'
計算velocity 和velocity 圖
通過擬合前體(未剪接的)和成熟(剪接的)mRNA豐度之間的比值來獲得基因特異性速度,該比值很好地解釋了穩(wěn)定狀態(tài)(恒定轉(zhuǎn)錄狀態(tài))畔乙,然后計算觀察到的豐度如何偏離穩(wěn)定狀態(tài)下的期望君仆。
scv.tl.velocity(adata)
computing velocities
finished (0:00:01) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
這計算了在高維空間中勢元躍遷與速度矢量的(余弦)相關(guān)性。結(jié)果的velocity 圖nobs×nobs牲距,并總結(jié)了可能的細(xì)胞狀態(tài)變化(從一個細(xì)胞到另一個細(xì)胞的過渡)返咱,這些變化通過速度向量得到了很好的解釋。如果你設(shè)置approx=True牍鞠,它是在一個減少的PCA空間上計算的咖摹,有50個PC。
然后皮服,通過在余弦相關(guān)上應(yīng)用高斯核楞艾,將速度圖轉(zhuǎn)換為一個過渡矩陣,該高斯核賦予與速度向量相關(guān)的細(xì)胞狀態(tài)變化的高概率龄广×蛎校可以通過scv.tl.transition_matrix訪問馬爾可夫轉(zhuǎn)移矩陣。由此產(chǎn)生的轉(zhuǎn)換矩陣可用于以下所示的各種應(yīng)用程序择同。例如两入,它通過簡單地應(yīng)用相對于轉(zhuǎn)移概率的平均轉(zhuǎn)移,即scv.tl. velocity_embedded敲才,來將速度放入低維嵌入中裹纳。此外择葡,通過scv.tl.terminal_states,我們可以沿著馬爾科夫鏈追溯細(xì)胞的起源和潛在的命運,從而獲得軌跡中的根細(xì)胞和終點剃氧。
scv.tl.velocity_graph(adata)
computing velocity graph
100% finished (0:00:04) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
Out[18]:
AnnData object with n_obs × n_vars = 7292 × 1999
obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'velocity_gamma', 'velocity_r2', 'velocity_genes'
uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key', 'velocity_settings', 'velocity_graph', 'velocity_graph_neg'
obsm: 'X_pca'
varm: 'PCs'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity'
obsp: 'distances', 'connectivities'
可視化結(jié)果
在可視化之前我們先計算一下umap:
scv.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.leiden(adata,adjacency = adata.obsp['connectivities'])
# 如果是pip安裝的敏储,adjacency = adata.obsp['connectivities']就不要了。
sc.tl.leiden(adata)
vdata = adata
#adata.uns['neighbors']['indices']
#adata.uns['connectivities_key']
#adata.obsp['connectivities'].tocoo()
scv.tl.umap(adata)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['leiden','initial_size_spliced'])
computing velocity embedding
finished (0:00:01) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
基因排序
help(scv.tl.rank_velocity_genes) # 改造函數(shù)
Help on function rank_velocity_genes in module scvelo.tools.rank_velocity_genes:
rank_velocity_genes(data, vkey='velocity', n_genes=10, groupby=None, match_with=None, resolution=None, min_counts=None, min_r2=None, min_dispersion=None, min_likelihood=None, copy=False)
Rank genes for velocity characterizing groups.
This applies a differential expression test (Welch t-test with overestimated variance to be conservative) on
velocity expression, to find genes in a cluster that show dynamics that is transcriptionally regulated differently
compared to all other clusters (e.g. induction in that cluster and homeostasis in remaining population).
If no clusters are given, it priorly computes velocity clusters by applying louvain modularity on velocity expression.
.. code:: python
scv.tl.rank_velocity_genes(adata, groupby='clusters')
scv.pl.scatter(adata, basis=adata.uns['rank_velocity_genes']['names']['Beta'][:3])
pd.DataFrame(adata.uns['rank_velocity_genes']['names']).head()
.. image:: https://user-images.githubusercontent.com/31883718/69626017-11c47980-1048-11ea-89f4-df3769df5ad5.png
:width: 600px
.. image:: https://user-images.githubusercontent.com/31883718/69626572-30774000-1049-11ea-871f-e8a30c42f10e.png
:width: 600px
Arguments
----------
data : :class:`~anndata.AnnData`
Annotated data matrix.
vkey: `str` (default: `'velocity'`)
Key of velocities computed in `tl.velocity`
n_genes : `int`, optional (default: 100)
The number of genes that appear in the returned tables.
groupby: `str`, `list` or `np.ndarray` (default: `None`)
Key of observations grouping to consider.
match_with: `str` or `None` (default: `None`)
adata.obs key to separatively rank velocities on.
resolution: `str` or `None` (default: `None`)
Resolution for louvain modularity.
min_counts: `float` (default: None)
Minimum count of genes for consideration.
min_r2: `float` (default: None)
Minimum r2 value of genes for consideration.
min_dispersion: `float` (default: None)
Minimum dispersion norm value of genes for consideration.
min_likelihood: `float` between `0` and `1` or `None` (default: `None`)
Only rank velocity of genes with a likelihood higher than min_likelihood.
copy: `bool` (default: `False`)
Return a copy instead of writing to data.
Returns
-------
Returns or updates `data` with the attributes
rank_velocity_genes : `.uns`
Structured array to be indexed by group id storing the gene
names. Ordered according to scores.
velocity_score : `.var`
Storing the score for each gene for each group. Ordered according to scores.
scv.tl.rank_velocity_genes(adata, match_with='leiden', resolution=.4)
import pandas as pd
pd.DataFrame(adata.uns['rank_velocity_genes']['names']).head()
Out[27]:
0 0 1 1 4 2 7 3 ... 0 9 0 10 10 11 5 12
0 GNLY NKG7 HCK CCL5 ... CD36 HLA-DQB1 PTGS1 LST1
1 MNDA RGS18 SMCO4 CD160 ... DMXL2 HLA-DQA1 CD22 CD22
2 CLEC10A METTL17 DMXL2 SH2D1B ... CD74 CCDC50 RGS18 CCDC50
3 HLA-DPA1 HLA-DPA1 CYP27A1 RGS18 ... CLEC10A TSPAN13 MMD HLA-DQB1
4 HLA-DRA DDX11 FGR HLA-DPA1 ... MMD CD22 FGR HLA-DQA1
[5 rows x 13 columns]
scv.pl.velocity_embedding_grid(adata, color='GNLY',
layer=['velocity', 'spliced'], arrow_size=1.5)
特定基因的剪切狀態(tài)朋鞍,velocity動態(tài)以及表達(dá)情況已添,直接整理成可以發(fā)表的形式:
var_names = ['GNLY', 'RGS18', 'DMXL2', 'SH2D1B']
scv.pl.velocity(adata, var_names=var_names, colorbar=True, ncols=2)
help(scv.pl.velocity_graph)
scv.pl.velocity_graph(adata,color='GNLY')
動力學(xué)分析
help(scv.tl.recover_dynamics)
Help on function recover_dynamics in module scvelo.tools.dynamical_model:
recover_dynamics(data, var_names='velocity_genes', n_top_genes=None, max_iter=10, assignment_mode='projection', t_max=None, fit_time=True, fit_scaling=True, fit_steady_states=True, fit_connected_states=None, fit_basal_transcription=None, use_raw=False, load_pars=None, return_model=None, plot_results=False, steady_state_prior=None, add_key='fit', copy=False, **kwargs)
Recovers the full splicing kinetics of specified genes.
The model infers transcription rates, splicing rates, degradation rates,
as well as cell-specific latent time and transcriptional states, estimated iteratively by expectation-maximization.
.. image:: https://user-images.githubusercontent.com/31883718/69636459-ef862800-1056-11ea-8803-0a787ede5ce9.png
Arguments
---------
data: :class:`~anndata.AnnData`
Annotated data matrix.
var_names: `str`, list of `str` (default: `'velocity_genes`)
Names of variables/genes to use for the fitting.
n_top_genes: `int` or `None` (default: `None`)
Number of top velocity genes to use for the dynamical model.
max_iter:`int` (default: `10`)
Maximal iterations in the EM-Algorithm.
assignment_mode: `str` (default: `projection`)
Determined how times are assigned to observations.
If `projection`, observations are projected onto the model trajectory.
Else uses an inverse approximating formula.
t_max: `float`, `False` or `None` (default: `None`)
Total range for time assignments.
fit_scaling: `bool` or `float` or `None` (default: `True`)
Whether to fit scaling between unspliced and spliced or keep initially given scaling fixed.
fit_time: `bool` or `float` or `None` (default: `True`)
Whether to fit time or keep initially given time fixed.
fit_steady_states: `bool` or `None` (default: `True`)
Allows fitting of observations to steady states next to repression and induction.
fit_connected_states: `bool` or `None` (default: `None`)
Restricts fitting to neighbors given by connectivities.
fit_basal_transcription: `bool` or `None` (default: `None`)
Enables model to incorporate basal transcriptions.
use_raw: `bool` or `None` (default: `None`)
if True, use .layers['sliced'], else use moments from .layers['Ms']
load_pars: `bool` or `None` (default: `None`)
Load parameters from past fits.
return_model: `bool` or `None` (default: `True`)
Whether to return the model as :DynamicsRecovery: object.
plot_results: `bool` or `None` (default: `False`)
Plot results after parameter inference.
steady_state_prior: list of `bool` or `None` (default: `None`)
Mask for indices used for steady state regression.
add_key: `str` (default: `'fit'`)
Key to add to parameter names, e.g. 'fit_t' for fitted time.
copy: `bool` (default: `False`)
Return a copy instead of writing to `adata`.
Returns
-------
Returns or updates `adata`
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
adata
Out[37]:
AnnData object with n_obs × n_vars = 7292 × 1999
obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'leiden', 'velocity_clusters'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'velocity_gamma', 'velocity_r2', 'velocity_genes', 'velocity_score', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_alignment_scaling', 'fit_r2'
uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key', 'velocity_settings', 'velocity_graph', 'velocity_graph_neg', 'leiden', 'umap', 'leiden_colors', 'rank_velocity_genes', 'recover_dynamics'
obsm: 'X_pca', 'X_umap', 'velocity_umap'
varm: 'PCs', 'loss'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity', 'fit_t', 'fit_tau', 'fit_tau_', 'velocity_u'
obsp: 'distances', 'connectivities'
scv.pl.scatter(adata, color='latent_time', fontsize=24, size=100,
color_map='gnuplot', perc=[2, 98], colorbar=True, rescale_color=[0,1])
top_genes = adata.var_names[adata.var.fit_likelihood.argsort()[::-1]][:300]
scv.pl.heatmap(adata, var_names=top_genes, tkey='latent_time', n_convolve=100, col_color='leiden')
top_genes
Out[43]:
Index(['TBXAS1', 'NKG7', 'PRAM1', 'HCK', 'SMCO4', 'CD22', 'SHTN1', 'CD74',
'HLA-DQA1', 'HLA-DRA',
...
'GSPT1', 'GALNS', 'NDUFAB1', 'NDUFV2-AS1', 'DHX40', 'PPM1D', 'BCAS3',
'METTL2A', 'TANC2', 'DDX42'],
dtype='object', length=300)
scv.pl.scatter(adata, basis=top_genes[:10], legend_loc='none',
size=80, frameon=False, ncols=5, fontsize=20,color='leiden')
scv.pl.scatter(adata, x='latent_time', y=top_genes[:4], fontsize=16, size=100,
n_convolve=None, frameon=False, legend_loc='none',color='leiden')