R在讀取和處理數(shù)據(jù)的過程中會(huì)將所有的變量和占用都儲(chǔ)存在RAM當(dāng)中摹蘑,這樣一來(lái)惊畏,對(duì)于海量的單細(xì)胞RNA-seq數(shù)據(jù)(尤其是超過250k的細(xì)胞量)舀透,即使在服務(wù)器當(dāng)中運(yùn)行栓票,Seurat、metacell愕够、monocle這一類的R包的使用還是會(huì)產(chǎn)生內(nèi)存不足的問題走贪。
但是最近我發(fā)現(xiàn)了一個(gè)基于python的單細(xì)胞基因表達(dá)分析包scanpy,能夠很好地在我這個(gè)僅4G內(nèi)存的小破機(jī)上分析378k的細(xì)胞惑芭,并且功能豐富程度不亞于Seurat坠狡。它包含了數(shù)據(jù)預(yù)處理、可視化遂跟、聚類擦秽、偽時(shí)間分析和軌跡推斷、差異表達(dá)分析漩勤。根據(jù)官網(wǎng)描述感挥,scanpy能夠有效處理超過1,000,000個(gè)細(xì)胞的數(shù)據(jù)集。
Scanpy – Single-Cell Analysis in Python:https://scanpy.readthedocs.io/en/latest/
安裝與數(shù)據(jù)下載
安裝好python3之后越败,終端運(yùn)行:
pip install scanpy
若安裝過程出現(xiàn)問題触幼,請(qǐng)參考:
https://scanpy.readthedocs.io/en/latest/installation.html
骨髓單細(xì)胞轉(zhuǎn)錄組測(cè)序數(shù)據(jù)下載地址:
https://preview.data.humancellatlas.org
Step0, 讀取數(shù)據(jù)
運(yùn)行python
import numpy as np
import pandas as pd
import scanpy as sc
# 可以直接讀取10Xgenomics的.h5格式數(shù)據(jù)
adata = sc.read_10x_h5("/Users/shinianyike/Desktop/ica_bone_marrow_h5.h5"
, genome=None, gex_only=True)
adata.var_names_make_unique()
查看數(shù)據(jù):
adata
AnnData object with n_obs × n_vars = 378000 × 33694
var: 'gene_ids'
共378000個(gè)細(xì)胞,33694個(gè)基因究飞。
Step1, 數(shù)據(jù)預(yù)處理
這一步目的將數(shù)據(jù)進(jìn)行篩選和過濾置谦,一些測(cè)序質(zhì)量差的數(shù)據(jù)會(huì)讓后續(xù)的分析產(chǎn)生噪音和干擾堂鲤,因此我們需要將它們?nèi)コ?/p>
展示在所有的細(xì)胞當(dāng)中表達(dá)占比最高的20個(gè)基因。
sc.pl.highest_expr_genes(adata, n_top=20)
基礎(chǔ)過濾:
去除表達(dá)基因200以下的細(xì)胞媒峡;
去除在3個(gè)細(xì)胞以下表達(dá)的基因瘟栖。
sc.pp.filter_cells(adata, min_genes=200) # 去除表達(dá)基因200以下的細(xì)胞
sc.pp.filter_genes(adata, min_cells=3) # 去除在3個(gè)細(xì)胞以下表達(dá)的基因
adata
AnnData object with n_obs × n_vars = 335618 × 24888
obs: 'n_genes'
var: 'gene_ids', 'n_cells'
共335618個(gè)細(xì)胞,24888個(gè)基因谅阿。(可以發(fā)現(xiàn)細(xì)胞跟基因數(shù)量都減少了)
質(zhì)量控制:
在測(cè)序后半哟,有很大比例是低質(zhì)量的細(xì)胞,可能是因?yàn)榧?xì)胞破碎造成的胞質(zhì)RNA流失签餐,由于線粒體比單個(gè)的轉(zhuǎn)錄分子要大得多寓涨,不容易在破碎的細(xì)胞膜中泄漏出去,這樣一來(lái)就導(dǎo)致測(cè)序結(jié)果顯示線粒體基因的比例在細(xì)胞內(nèi)占比異常高氯檐。這一步質(zhì)量控制的目的就是將這些低質(zhì)量的細(xì)胞去除掉戒良。
計(jì)算線粒體基因占所有基因的比例:
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
作小提琴圖,查看線粒體基因占比分布:
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4, multi_panel=True)
由于數(shù)據(jù)點(diǎn)實(shí)在太多冠摄,小提琴已被數(shù)據(jù)點(diǎn)覆蓋糯崎,無(wú)法顯示出來(lái)。
這里還可以做一個(gè)散點(diǎn)圖河泳,也可以直觀地顯示出一些異常分布的數(shù)據(jù)點(diǎn)沃呢。
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')
adata
AnnData object with n_obs × n_vars = 335618 × 24888
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells'
335618個(gè)細(xì)胞,24888個(gè)基因乔询;
下面這一步進(jìn)行真正的過濾樟插,根據(jù)上面做的圖,去除基因數(shù)目過多(大于等于4000)和線粒體基因占比過多(大于等于0.3)的細(xì)胞:
adata = adata[adata.obs['n_genes'] < 4000, :]
adata = adata[adata.obs['percent_mito'] < 0.3, :]
過濾后查看剩下多少細(xì)胞和基因竿刁。
adata
View of AnnData object with n_obs × n_vars = 328435 × 24888
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells'
328435個(gè)細(xì)胞黄锤,24888個(gè)基因。
數(shù)據(jù)標(biāo)準(zhǔn)化
在繪圖之前食拜,還要進(jìn)行一步數(shù)據(jù)標(biāo)準(zhǔn)化鸵熟,將表達(dá)量用對(duì)數(shù)計(jì)算一遍,這樣有利于繪圖和展示负甸。
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)
adata.raw = adata # 儲(chǔ)存標(biāo)準(zhǔn)化后的AnnaData Object
識(shí)別差異表達(dá)基因
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
將保守的基因去除流强,留下差異表達(dá)的基因用于后續(xù)分析。
adata = adata[:, adata.var['highly_variable']]
adata
View of AnnData object with n_obs × n_vars = 328435 × 1372
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
328435個(gè)細(xì)胞呻待,1372個(gè)基因打月。
回歸每個(gè)細(xì)胞總計(jì)數(shù)和線粒體基因表達(dá)百分比的影響。將數(shù)據(jù)放縮到方差為1蚕捉。單細(xì)胞數(shù)據(jù)集可能包含“不感興趣”的變異來(lái)源奏篙。這不僅包括技術(shù)噪音,還包括批次效應(yīng),甚至包括生物變異來(lái)源(細(xì)胞周期階段)秘通。正如(Buettner, et al NBT为严,2015)中所建議的那樣,從分析中回歸這些信號(hào)可以改善下游維數(shù)減少和聚類肺稀。
這一步高內(nèi)存需求預(yù)警第股,建議清理電腦緩存,關(guān)閉后臺(tái)不使用的應(yīng)用话原。
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
/Users/shinianyike/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
Scale each gene to unit variance. Clip values exceeding standard deviation 10.
sc.pp.scale(adata, max_value=10)
Step2, 主成分分析
主成分分析是一種將數(shù)據(jù)降維的分析方法夕吻,是考察多個(gè)變量間相關(guān)性一種多元統(tǒng)計(jì)方法,研究如何通過少數(shù)幾個(gè)主成分來(lái)揭示多個(gè)變量間的內(nèi)部結(jié)構(gòu)稿静,即從原始變量中導(dǎo)出少數(shù)幾個(gè)主成分梭冠,使它們盡可能多地保留原始變量的信息辕狰,且彼此間互不相關(guān).通常數(shù)學(xué)上的處理就是將原來(lái)P個(gè)指標(biāo)作線性組合改备,作為新的綜合指標(biāo)。
sc.tl.pca(adata, svd_solver='arpack') # PCA分析
sc.pl.pca(adata, color='CST3') #繪圖
作碎石圖蔓倍,觀測(cè)主成分的質(zhì)量悬钳。這個(gè)圖用于選擇后續(xù)應(yīng)該使用多少個(gè)PC,用于計(jì)算細(xì)胞間的相鄰距離偶翅。
sc.pl.pca_variance_ratio(adata, log=True)
在這里先將數(shù)據(jù)保存默勾,便于后續(xù)操作的更改。
adata.write("pca_results.h5ad")
Step3医咨, 聚類分析
計(jì)算細(xì)胞間的距離:
這里的參數(shù)就先按照默認(rèn)值設(shè)定:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
參數(shù)說(shuō)明:
n_neighbors指的是每個(gè)點(diǎn)的鄰近點(diǎn)的數(shù)量拾徙,據(jù)評(píng)論區(qū)@小光amateur 所說(shuō)neighbors的個(gè)數(shù)越多苇羡,聚類數(shù)會(huì)越少。
pc的數(shù)量依賴于上面所做的碎石圖环疼,一般是選在拐點(diǎn)處的的主成分,只需要一個(gè)粗略值朵耕,不同的pc數(shù)量所產(chǎn)生的聚類形狀也不同炫隶。我后來(lái)更改為PC=16,效果比下圖要好一些阎曹。
將距離嵌入圖中:
sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)
聚類
sc.tl.louvain(adata)
sc.pl.umap(adata, color=['louvain'])
這里得到了29類細(xì)胞伪阶,每個(gè)顏色代表一種。
將數(shù)據(jù)保存处嫌。
adata.write("umap.h5ad")
尋找marker基因
marker基因通常是細(xì)胞表面抗原栅贴,用于定義出該細(xì)胞的細(xì)胞類型。
為了定義每個(gè)簇屬于什么細(xì)胞熏迹,根據(jù)基因的差異表達(dá)水平檐薯,將每個(gè)簇排名前25的基因?qū)С觥?/p>
sc.tl.rank_genes_groups(adata, 'louvain', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
Preprosessing the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
scanpy==1.4 anndata==0.6.19 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
adata = sc.read_h5ad("/bone_marrow/scanpy/3_29_PC16_filterMore/umap_tsne_3_29.h5ad")
sc.tl.draw_graph(adata)
drawing single-cell graph using layout "fa"
finished (1:06:05.80) --> added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")
Denoising the graph(will skip it next time!)
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')
computing Diffusion Maps using n_comps=15(=n_dcs)
eigenvalues of transition matrix
[1. 0.99998933 0.9999825 0.9999806 0.9999773 0.99997413
0.99997026 0.999969 0.99996084 0.9999516 0.9999409 0.9999385
0.9999321 0.99992156 0.9999118 ]
finished (0:11:57.51) --> added
'X_diffmap', diffmap coordinates (adata.obsm)
'diffmap_evals', eigenvalues of transition matrix (adata.uns)
computing neighbors
finished (0:01:30.84) --> added to `.uns['neighbors']`
'distances', distances for each pair of neighbors
'connectivities', weighted adjacency matrix
sc.tl.draw_graph(adata)
drawing single-cell graph using layout "fa"
finished (1:05:24.51) --> added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")
..didn't see any denoising effect
PAGA
Annotate the clusters using marker genes.
sc.tl.paga(adata, groups='louvain')
running PAGA
finished (0:00:13.69) --> added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns)
sc.pl.paga(adata, color=['louvain'],title = "")
--> added 'pos', the PAGA positions (adata.uns['paga'])
sc.pl.paga(adata, color=['CD34', 'GYPB', 'MS4A1', 'IL7R'])
--> added 'pos', the PAGA positions (adata.uns['paga'])
Annote groups with cell type
adata.obs['louvain'].cat.categories
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
dtype='object')
adata.obs['louvain_anno'] = adata.obs['louvain']
# annote them with names
adata.obs['louvain_anno'].cat.categories = ['0/T', '1/B', '2', '3/T', '4/MDDC', '5', '6/MDDC', '7/NK', '8/MDDC', '9/CD8+T', '10/NK', '11/B', '12/NRBC',
'13', '14/CD1C-CD141-DC', '15/pDC', '16/Macro,DC', '17/pDC', '18/DC', '19/transB,Plasmab', '20','21','22/B,NK','23']
Cluster | Cell Type | Marker Gene |
---|---|---|
0 | T cell/IL-17Ralpha T cell | IL7R, CD3E, CD3D |
1 | B cell | MS4A1, CD79A |
2 | 高表達(dá)核糖體蛋白基因 | |
3 | CD8+ T cell, T helper, angiogenic T cell | CD3E, CXCR4, CD3D, CCL5, GZMK |
4 | Monocyte derived dendritic cell | S100A8, S100A9 |
5 | 高表達(dá)核糖體蛋白基因 | |
6 | Monocyte derived dendritic cell | S100A8, S100A9 |
7 | NK cell | PRF1, NKG7, KLRB1, KLRD1 |
8 | Monocyte derived dendritic cell | S100A8, S100A9 |
9 | CD8+ T cell | GZMK, CD3D, CD8A, NKG7 * |
10 | NK Cell | GNLY, NKG7, PTPRC |
11 | B cell | CD24, CD79A, CD37, CD79B |
12 | Red blood cell(Erythrocyte) | HBB, HBA1,GYPA |
13 | not known | |
14 | CD1C-CD141- dendritic cell | FCGR3A, CST3 |
15 | Plasmacytiod dendritic cell | HSP90B1, SSR4, PDIA4, SEC11C, MZB1, UBE2J1, FKBP2, DERL3, HERPUD1, ITM2C |
16 | Macrophage/ dendritic cell | LYZ, HLA-DQA1, AIF1, CD74, FCER1A, CST3 |
17 | Plasmacytiod dendritic cell | IRF8, TCF4, LILRA4 * |
18 | Megakaryocyte progenitor cell/Megakaryocyte | PF4, PPBP, / GP9 |
19 | transitional B cell / Plasmablast | CD24, CD79B |
20 | not known | |
21 | B cell | MS4A1, CD79A, CD37, CD74 |
22 | B cell , NK cell | CD74,CD79A, NKG7, GZMH |
23 | not known |
上面這個(gè)大家看看就好癣缅,我自己也不確定厨剪,請(qǐng)自行翻閱文獻(xiàn):逶汀!祷膳!
sc.tl.paga(adata, groups='louvain_anno')
running PAGA
finished (0:00:13.55) --> added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns)
sc.pl.paga(adata, threshold=0.03)
--> added 'pos', the PAGA positions (adata.uns['paga'])
adata
AnnData object with n_obs × n_vars = 315509 × 1314
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'louvain_anno'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'draw_graph', 'diffmap_evals', 'paga', 'louvain_sizes', 'louvain_anno_sizes', 'louvain_anno_colors'
obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
sc.tl.draw_graph(adata, init_pos='paga')
drawing single-cell graph using layout "fa"
finished (1:03:55.77) --> added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
Add pesudotime parameters
# the most primitive cell is refered as 0 persudotime.
# Group 13 is the nearest cell population to Hematopoietic stem cell.
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '13')[0]
sc.tl.dpt(adata)
computing Diffusion Pseudotime using n_dcs=10
finished (0:00:00.04) --> added
'dpt_pseudotime', the pseudotime (adata.obs)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
legend_loc='right margin',title = ['','pseudotime'])
sc.pl.draw_graph(adata, color=['louvain_anno'],
legend_loc='right margin',title = ['']) #plot again to see full legends info
try other "iroot" setting
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '5')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
legend_loc='right margin',title = ['','pseudotime'])
computing Diffusion Pseudotime using n_dcs=10
finished (0:00:00.04) --> added
'dpt_pseudotime', the pseudotime (adata.obs)
Several other cell types are chosen to be "root" for diffusion pseudotime, however the pseudotime graphs look no big different.
..it doesn't look meaningful. didn't see any trajectory to describe cell development.
I think the "denoising graph" step is to blame. Will skip it next time.
Otherwise i should zoom it into a specific cell population, but have no idea which kind of cell i should choose...
Beautify the graphs
Choose the colors of the clusters a bit more consistently.
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()
zeileis_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = np.array(adata.uns['louvain_anno_colors'])
new_colors[[13]] = zeileis_colors[[12]] # Stem(?) colors / green
new_colors[[12]] = zeileis_colors[[5]] # Ery colors / red
new_colors[[4,6,8,15,17]] = zeileis_colors[[17,17,17,16,16]] # monocyte derived dendritic cell and pDC/ yellow
new_colors[[14,16,18]] = zeileis_colors[[16,16,16]] # DC / yellow
new_colors[[0,3,9]] = zeileis_colors[[6,6,6]] # T cell / light blue
new_colors[[7,10]] = zeileis_colors[[0,0]] # NK cell / dark blue
new_colors[[1,11,22,19]] = zeileis_colors[[22,22,22,21]] # B cell / pink
new_colors[[21,23,20]] = zeileis_colors[[25,25,25]] # Not known / grey
new_colors[[2, 5]] = zeileis_colors[[25, 25]] # outliers / grey
adata.uns['louvain_anno_colors'] = new_colors
adata
AnnData object with n_obs × n_vars = 315509 × 1314
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'louvain_anno', 'dpt_pseudotime'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'draw_graph', 'diffmap_evals', 'paga', 'louvain_sizes', 'louvain_anno_sizes', 'louvain_anno_colors', 'iroot'
obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
sc.pl.draw_graph(adata, color=['louvain_anno'],
legend_loc='right margin',title = [''])
this is a piece of shit.(翻車了 陶衅,哈哈)
<article class="_2rhmJa" style="box-sizing: border-box; display: block; font-weight: 400; line-height: 1.8; margin-bottom: 20px; color: rgb(64, 64, 64); font-family: -apple-system, BlinkMacSystemFont, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Segoe UI", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
與上次翻車實(shí)驗(yàn)的不同:
- 過濾了更多的細(xì)胞和基因。在處理數(shù)據(jù)時(shí)直晨,低質(zhì)量的細(xì)胞一定要清除掉搀军,告誡大家寧缺毋濫。勇皇。罩句。否則后續(xù)分析真的是一堆的噪點(diǎn)。
- 跳過了scanpy中的降噪步驟敛摘。scanpy的降噪算法做的真不咋地门烂,一個(gè)好圖降噪之后更加混亂了。這個(gè)故事告訴我們一個(gè)道理:不要迷信作者說(shuō)的話P忠M驮丁!
對(duì)翻車實(shí)驗(yàn)感興趣的話:http://www.reibang.com/p/e688646a451b
什么是軌跡推斷/偽時(shí)間分析捕虽?
軌跡推斷對(duì)應(yīng)英文是trajectory inference慨丐,偽時(shí)間分析對(duì)應(yīng)英文是pseudotime analysis。其實(shí)它們做的是同一件事泄私。
在單細(xì)胞轉(zhuǎn)錄組測(cè)序完成的時(shí)候房揭,無(wú)論是細(xì)胞在發(fā)育、分化還是凋亡的過程中晌端,它們的生命活動(dòng)已經(jīng)在這一時(shí)刻靜止了捅暴,因此我們可以將這些測(cè)序數(shù)據(jù)看作是這一個(gè)時(shí)刻細(xì)胞呈現(xiàn)的表達(dá)狀態(tài)的瞬時(shí)截圖。
而軌跡推斷/偽時(shí)間分析則構(gòu)建了一個(gè)細(xì)胞發(fā)育和分化的模型斩松。這一個(gè)瞬時(shí)截圖內(nèi)的細(xì)胞在各種各樣不同的發(fā)育狀態(tài)伶唯,有的發(fā)育早,有的發(fā)育晚惧盹,有的分化了乳幸,有的未分化,甚至有的處于中間態(tài)钧椰。而決定細(xì)胞分化往往就是基因表達(dá)的變化粹断。
軌跡推斷利用各類細(xì)胞基因表達(dá)的連續(xù)性建立了一個(gè)“分化軌跡”,再將這些細(xì)胞投射到這個(gè)“分化軌跡”上嫡霞。如此一來(lái)瓶埋,這個(gè)靜態(tài)的截圖就呈現(xiàn)出了動(dòng)態(tài)的過程。
PS:細(xì)胞類型分類真是一場(chǎng)艱苦卓絕的戰(zhàn)斗!Q病曾撤!一入文獻(xiàn)深似海!T畏唷<废ぁ!
0. 數(shù)據(jù)預(yù)處理
先加載需要的包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
版本信息:
scanpy==1.4 anndata==0.6.19 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
讀取數(shù)據(jù)(數(shù)據(jù)來(lái)源詳見http://www.reibang.com/p/b190efae4d31)
這個(gè)數(shù)據(jù)是利用scanpy進(jìn)行聚類后的對(duì)象巫湘。
adata = sc.read_h5ad("/data1/zll/project/deepBase3/HCA/bone_marrow/scanpy/3_29_PC16_filterMore/umap_tsne_3_29.h5ad")
預(yù)處理數(shù)據(jù)装悲,計(jì)算距離并可視化。
sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")
1. Partition-based Graph Abstraction(PAGA)
這是一種基于空間劃分的抽提細(xì)胞分化“骨架”的一種算法尚氛,用于顯示細(xì)胞的分化軌跡诀诊,得到哪些cluster之間的關(guān)系更接近。
sc.tl.paga(adata, groups='louvain')
sc.pl.paga(adata, color=['louvain'],title = "")
AVP是造血干細(xì)胞表達(dá)的一個(gè)基因阅嘶,給這個(gè)基因上色属瓣,我們可以看到基本上都富集在了cluster13的位置。那么有很大可能這個(gè)cluster就是造血干細(xì)胞奈懒。
sc.pl.paga(adata, color=['AVP'])
2.注釋細(xì)胞類型
上面的每個(gè)cluster都是用數(shù)字編號(hào)代替奠涌,為了能更清楚地知道這些細(xì)胞之間的關(guān)系宪巨,將細(xì)胞類型的信息注釋上去磷杏。細(xì)胞類型的確定主要利用了marker基因以及現(xiàn)有的文獻(xiàn)和資料,是一個(gè)非常復(fù)雜的過程捏卓〖觯總之在與文獻(xiàn)的一番斗爭(zhēng)之后,找到了以下這些骨髓的細(xì)胞類型(僅供參考):
Cluster | Cell Type | Marker Gene |
---|---|---|
0 | naive T cell | CD3E, CD3D, RPS29 |
1 | B cell | MS4A1, CD79A |
2 | naive T cell | RPL34,RPS6 |
3 | CD8+ T cell | CCL5, GZMK, IL32 |
4 | Neutrophil | S100A8, S100A9,VCAN |
5 | naive T cell | RPL32,RPL13 |
6 | Neutrophil | FTL, S100A8, S100A9 |
7 | NK cell | PRF1, NKG7, KLRB1, KLRD1 |
8 | Eosinophil | LYZ, LGALS1,MT-CO3 |
9 | CD8+ T cell | GZMK, CD3D, CD8A, NKG7 |
10 | Eosinophil | MT-CO2, MT-CYB,MT-CO3 |
11 | CD34+ pre-B | CD79B, VPREB3 |
12 | Nuceated Red blood cell(Erythroblast) | HBB, HBA1,AHSP |
13 | CD34+ CMP,CD34+ HSC,Early-ERP | HNRNPA1, BTF3,GNB2L1,NPM1 |
14 | Monocyte | FCGR3A, LST1, AIF1 |
15 | Plasma Cell | MZB1, SSR4, FKBP11 |
16 | pre-dendritic cell | CST3, HLA-DPA1,FCER1A |
17 | dendritic cell | ITM2C, SEC61B,JCHAIN |
18 | Platelet | PF4, PPBP, GP9 |
19 | pro-B cell | HMGB1, IGLL1,STMN1 |
20 | Stromal cell | AOX1, SEPP1 |
21 | Eosinophil | MT-CO2, MT-ND3,MT-CO3, |
22 | B cell , NK cell (mixture,not sure) | CD74,CD79A, NKG7, GZMH |
23 | pro-B | STMN1, TUBA1B |
下面進(jìn)行注釋:
adata.obs['louvain'].cat.categories
返回共有24個(gè)cluster:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
dtype='object')
將編號(hào)所對(duì)應(yīng)的細(xì)胞類型注釋上去:
adata.obs['louvain_anno'] = adata.obs['louvain']
# annote them with names
adata.obs['louvain_anno'].cat.categories = ['0/nT', '1/B', '2nT', '3/CD8+T',
'4/Neu', '5/nT', '6/Neu', '7/NK',
'8/Eos', '9/CD8+T', '10/Eos', '11/pre-B',
'12/NRBC','13/HSC', '14/Mon', '15/plasma',
'16/preDC', '17/DC', '18/plt', '19/pro-B',
'20/strom','21/Eos','22/','23/pro-B']
3. 計(jì)算 PAGA并可視化
sc.tl.paga(adata, groups='louvain_anno')
sc.pl.paga(adata, threshold=0.03)
于是就得到了這樣像“骨架”一樣的結(jié)果怠晴。這個(gè)圖顯示的就是細(xì)胞與細(xì)胞之間的關(guān)系遥金,距離越近表示關(guān)系越接近。
4. 利用PAGA重新計(jì)算細(xì)胞之間的距離
還記得我們第0步計(jì)算的距離嗎蒜田?現(xiàn)在我們要將細(xì)胞在PAGA這個(gè)“骨架”上重現(xiàn)出來(lái)稿械,就利用PAGA的計(jì)算結(jié)果,把細(xì)胞放上去冲粤。
sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color=['louvain_anno'], legend_loc='right margin')
5.(可選)美化圖片
Choose the colors of the clusters a bit more consistently.
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()
zeileis_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = np.array(adata.uns['louvain_anno_colors'])
new_colors[[13]] = zeileis_colors[[17]] # Stem colors / yellow
new_colors[[12]] = zeileis_colors[[5]] # Ery colors / red
new_colors[[4,6]] = zeileis_colors[[12,12]] # Neutrophil / green
new_colors[[8,10]] = zeileis_colors[[11,11]] # Eosinophil / light red
new_colors[[14]] = zeileis_colors[[19]] # monocyte / light green
new_colors[[16,17]] = zeileis_colors[[18,26]] # DC / yellow
new_colors[[0,2,3,5,9]] = zeileis_colors[[7,7,6,7,6]] # naive T & CD8+T / light blue
new_colors[[7]] = zeileis_colors[[0]] # NK cell / dark blue
new_colors[[1,11,19,23]] = zeileis_colors[[23,22,9,9]] # B cell / pink
new_colors[[22]] = zeileis_colors[[25]] # Not known / grey
new_colors[[15]] = zeileis_colors[[3]] #plasma cell / blue grey
new_colors[[18]] = zeileis_colors[[4]] #platelet / pink grey
adata.uns['louvain_anno_colors'] = new_colors
And add some white space to some cluster names.
sc.pl.paga_compare(
adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=True)
--> added 'pos', the PAGA positions (adata.uns['paga'])
saving figure to file ./figures/paga_compare.pdf
6. 計(jì)算偽時(shí)間值
首先需要定義一個(gè)偽時(shí)間值為0的細(xì)胞群體美莫,一般來(lái)說(shuō)就是干細(xì)胞或者祖細(xì)胞√莶叮總之就是最原始的細(xì)胞類型厢呵。
# the most primitive cell is refered as 0 persudotime.
# Group 13 is the nearest cell population to Hematopoietic stem cell.
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '13/HSC')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
legend_loc='right margin',title = ['','pseudotime'])
偽時(shí)間圖的分化軌跡無(wú)法用色彩體現(xiàn)出來(lái),可能是偽時(shí)間參數(shù)的問題傀顾。
上面的圖注被遮住了襟铭,所以單獨(dú)做一個(gè)圖。
#plot again to see full legends info
sc.pl.draw_graph(adata, color=['louvain_anno'],
legend_loc='right margin',title = [''])
保存結(jié)果文件:
adata.write("/data1/zll/project/deepBase3/HCA/bone_marrow/scanpy/3_31_traj/trajectory_3_31.h5ad")
關(guān)于scanpy如何進(jìn)行多樣本整合數(shù)據(jù)分析,還望道友們不吝賜教寒砖!