RNA速率分析是一個(gè)非常重要的分析內(nèi)容,我們這里來深入解析一下這個(gè)分析內(nèi)容(python版本缓溅,個(gè)人喜歡python),參考網(wǎng)址在Velocyto
安裝velocyto其實(shí)很簡單,直接pip安裝
pip install velocyto
至于用法:
velocyto --help
Usage: velocyto [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
run Runs the velocity analysis outputting a loom file
run10x Runs the velocity analysis for a Chromium Sample
run-dropest Runs the velocity analysis on DropEst preprocessed data
run-smartseq2 Runs the velocity analysis on SmartSeq2 data (independent bam file per cell)
tools helper tools for velocyto
其實(shí)做velocyto分析使用的是轉(zhuǎn)錄組的可變剪切瓷们,那么對(duì)于轉(zhuǎn)錄組的要求當(dāng)然是越長越好,最好是全長秒咐,也就是說用smartseq2的數(shù)據(jù)做這個(gè)分析最為合適谬晕,但是10X數(shù)據(jù)最多最火,也可以嘗試携取。
velocyto run10x --help
Usage: velocyto run10x [OPTIONS] SAMPLEFOLDER GTFFILE
Runs the velocity analysis for a Chromium 10X Sample
10XSAMPLEFOLDER specifies the cellranger sample folder
GTFFILE genome annotation file
Options:
-s, --metadatatable FILE Table containing metadata of the various samples (csv fortmated rows are samples and cols are entries)
-m, --mask FILE .gtf file containing intervals to mask
-l, --logic TEXT The logic to use for the filtering (default: Default)
-M, --multimap Consider not unique mappings (not reccomended)
-@, --samtools-threads INTEGER The number of threads to use to sort the bam by cellID file using samtools
--samtools-memory INTEGER The number of MB used for every thread by samtools to sort the bam file
-t, --dtype TEXT The dtype of the loom file layers - if more than 6000 molecules/reads per gene per cell are expected set uint32 to avoid truncation (default run_10x: uint16)
-d, --dump TEXT For debugging purposes only: it will dump a molecular mapping report to hdf5. --dump N, saves a cell every N cells. If p is prepended a more complete (but huge) pickle report is printed (default: 0)
-v, --verbose Set the vebosity level: -v (only warinings) -vv (warinings and info) -vvv (warinings, info and debug)
--help Show this message and exit.
簡單的用法
velocyto run10x -m repeat_msk.gtf mypath/sample01 somepath/refdata-cellranger-mm10-1.2.0/genes/genes.gtf
產(chǎn)生的loom文件將用于下游的分析攒钳。
我們這里詳細(xì)看看這個(gè)文件及分析是怎樣的。(這里還是python為例雷滋,R也可以R讀loom需要hdf5r不撑,非常難裝,裝成功了一定要記錄晤斩,這也是我討厭R的原因之一)
首先第一步焕檬,我們來看看生成的loom文件里面都有什么
import scvelo as scv
scv.set_figure_params()##設(shè)置可視化的一些參數(shù)
data = scv.read('TH_possorted_genome_bam_YHVE5.loom', cache=True) ###讀取loom文件
當(dāng)然我們這里也可以讀取h5ad格式的文件(由python分析軟件scanpy生成)。
data = scv.read(filename, cache=True)
我們來看看loom文件里面有什么:
data
AnnData object with n_obs × n_vars = 70668 × 27998
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'
data.var
Accession Chromosome End Start Strand
Xkr4 ENSMUSG00000051951 1 3671498 3205901 -
Gm37381 ENSMUSG00000102343 1 3986215 3905739 -
Rp1 ENSMUSG00000025900 1 4409241 3999557 -
Rp1 ENSMUSG00000109048 1 4409187 4292981 -
Sox17 ENSMUSG00000025902 1 4497354 4490931 -
... ... ... ... ... ...
Gm28672 ENSMUSG00000100492 Y 89225296 89222067 +
Gm28670 ENSMUSG00000099982 Y 89394761 89391528 +
Gm29504 ENSMUSG00000100533 Y 90277501 90275224 +
Gm20837 ENSMUSG00000096178 Y 90433263 90401248 +
Erdr1 ENSMUSG00000096768 Y 90816464 90784738 +
###var里面是一些基因的信息
ldata.layers
Layers with keys: matrix, ambiguous, spliced, unspliced
###layer下面的東西有點(diǎn)多澳泵,包括矩陣信息实愚,以及基因是否是可變剪切等信息矩陣。
可以看出這個(gè)地方只有關(guān)于可變剪切的信息兔辅,沒有我們單細(xì)胞聚類得到的信息腊敲,所有我們需要將聚類結(jié)果與該loom文件的內(nèi)容進(jìn)行整合(merge),注意這里是python维苔,需要讀取scanpy的分析結(jié)果h5ad文件.但這里我們沒有h5ad文件碰辅,只好讀取Seurat的單獨(dú)結(jié)果進(jìn)行賦值,這里我們只需要聚類信息和二維坐標(biāo)信息
import pandas as pd
import numpy as np
cluster = pd.read_csv(file)
adata.obs['clusters'] = cluster['Cluster']
tsne_axis=open(file,'r')
for i in all_tsne[1:]:
tsne_axis_dict={}
all_tsne=tsne_axis.readlines()
... key=i.strip().split(',')[0]
... tsne_axis_dict[key]=[]
... tsne_axis_dict[key].append(i.strip().split(',')[1])
... tsne_axis_dict[key].append(i.strip().split(',')[2])
... tsne_axis_dict[key]=np.array(tsne_axis_dict[key],dtype='float32')
tsne_axis_list=[]
for i in Barcode:
tsne_axis_list.append(tsne_axis_dict[i])
adata.obsm['X_tsne']=np.array(tsne_axis_list)
###UMAP的賦值也是一樣的蕉鸳,這樣我們就得到了可以直接分析的無縫銜接的文件乎赴。
接下來就是可視化了,就相對(duì)簡單了
scv.pl.proportions(adata)
這個(gè)圖展示的是每個(gè)cluster可變切剪的比例值潮尝,這個(gè)地方我們需要注意榕吼,跟聚類結(jié)果嚴(yán)格相關(guān),也就是說我們必須確保聚類結(jié)果相對(duì)完美的情況下做這個(gè)分析勉失。
接下來我們要進(jìn)行RNA速率分析羹蚣,這里我們要注意,我們嫁接了Seurat 的結(jié)果乱凿,不要再進(jìn)行過濾等操作
scv.tl.velocity(adata)
###The computed velocities are stored in adata.layers just like the count matrices.
scv.tl.velocity_graph(adata)
###For a variety of applications, the velocity graph can be converted to a transition matrix by applying a Gaussian kernel to transform the cosine correlations into actual transition probabilities. You can access the Markov transition matrix via scv.utils.get_transition_matrix.
接下來就是一些展示了顽素,很簡單
scv.pl.velocity_embedding_stream(adata, basis='umap')
scv.pl.velocity_embedding(adata, arrow_length=3, arrow_size=2, dpi=120)
marker gene的展示
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2', 'Adk'], ncols=2)
Identify important genes
We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population. The module scv.tl.rank_velocity_genes
runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. min_corr
) to restrict the test on a selection of gene candidates.
scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()
kwargs = dict(frameon=False, size=10, linewidth=1.5,
add_outline='Ngn3 high EP, Pre-endocrine, Beta')
scv.pl.scatter(adata, df['Ngn3 high EP'][:5], ylabel='Ngn3 high EP', **kwargs)
scv.pl.scatter(adata, df['Pre-endocrine'][:5], ylabel='Pre-endocrine', **kwargs)
接下來的可視化分析都很簡單咽弦,我就不多說了,大家趕緊去試試吧
請(qǐng)保持憤怒胁出,讓王多魚傾家蕩產(chǎn)