scanpy分析單細(xì)胞數(shù)據(jù)

R在讀取和處理數(shù)據(jù)的過程中會(huì)將所有的變量和占用都儲(chǔ)存在RAM當(dāng)中摹蘑,這樣一來(lái)惊畏,對(duì)于海量的單細(xì)胞RNA-seq數(shù)據(jù)(尤其是超過250k的細(xì)胞量)舀透,即使在服務(wù)器當(dāng)中運(yùn)行栓票,Seurat、metacell愕够、monocle這一類的R包的使用還是會(huì)產(chǎn)生內(nèi)存不足的問題走贪。

但是最近我發(fā)現(xiàn)了一個(gè)基于python的單細(xì)胞基因表達(dá)分析包scanpy,能夠很好地在我這個(gè)僅4G內(nèi)存的小破機(jī)上分析378k的細(xì)胞惑芭,并且功能豐富程度不亞于Seurat坠狡。它包含了數(shù)據(jù)預(yù)處理、可視化遂跟、聚類擦秽、偽時(shí)間分析和軌跡推斷、差異表達(dá)分析漩勤。根據(jù)官網(wǎng)描述感挥,scanpy能夠有效處理超過1,000,000個(gè)細(xì)胞的數(shù)據(jù)集。

Scanpy – Single-Cell Analysis in Python:https://scanpy.readthedocs.io/en/latest/


安裝與數(shù)據(jù)下載

安裝好python3之后越败,終端運(yùn)行:

pip install scanpy

若安裝過程出現(xiàn)問題触幼,請(qǐng)參考:
https://scanpy.readthedocs.io/en/latest/installation.html

骨髓單細(xì)胞轉(zhuǎn)錄組測(cè)序數(shù)據(jù)下載地址:
https://preview.data.humancellatlas.org


Step0, 讀取數(shù)據(jù)

運(yùn)行python

import numpy as np
import pandas as pd
import scanpy as sc
# 可以直接讀取10Xgenomics的.h5格式數(shù)據(jù)
adata = sc.read_10x_h5("/Users/shinianyike/Desktop/ica_bone_marrow_h5.h5"
               , genome=None, gex_only=True)
adata.var_names_make_unique()

查看數(shù)據(jù):

adata

    AnnData object with n_obs × n_vars = 378000 × 33694 
        var: 'gene_ids'

共378000個(gè)細(xì)胞,33694個(gè)基因究飞。


Step1, 數(shù)據(jù)預(yù)處理

這一步目的將數(shù)據(jù)進(jìn)行篩選和過濾置谦,一些測(cè)序質(zhì)量差的數(shù)據(jù)會(huì)讓后續(xù)的分析產(chǎn)生噪音和干擾堂鲤,因此我們需要將它們?nèi)コ?/p>

展示在所有的細(xì)胞當(dāng)中表達(dá)占比最高的20個(gè)基因。

sc.pl.highest_expr_genes(adata, n_top=20)

image

基礎(chǔ)過濾:
去除表達(dá)基因200以下的細(xì)胞媒峡;
去除在3個(gè)細(xì)胞以下表達(dá)的基因瘟栖。

sc.pp.filter_cells(adata, min_genes=200)   # 去除表達(dá)基因200以下的細(xì)胞
sc.pp.filter_genes(adata, min_cells=3)     # 去除在3個(gè)細(xì)胞以下表達(dá)的基因

adata

    AnnData object with n_obs × n_vars = 335618 × 24888 
        obs: 'n_genes'
        var: 'gene_ids', 'n_cells'

共335618個(gè)細(xì)胞,24888個(gè)基因谅阿。(可以發(fā)現(xiàn)細(xì)胞跟基因數(shù)量都減少了)

質(zhì)量控制:

在測(cè)序后半哟,有很大比例是低質(zhì)量的細(xì)胞,可能是因?yàn)榧?xì)胞破碎造成的胞質(zhì)RNA流失签餐,由于線粒體比單個(gè)的轉(zhuǎn)錄分子要大得多寓涨,不容易在破碎的細(xì)胞膜中泄漏出去,這樣一來(lái)就導(dǎo)致測(cè)序結(jié)果顯示線粒體基因的比例在細(xì)胞內(nèi)占比異常高氯檐。這一步質(zhì)量控制的目的就是將這些低質(zhì)量的細(xì)胞去除掉戒良。

計(jì)算線粒體基因占所有基因的比例:

mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

作小提琴圖,查看線粒體基因占比分布:

sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)

image

由于數(shù)據(jù)點(diǎn)實(shí)在太多冠摄,小提琴已被數(shù)據(jù)點(diǎn)覆蓋糯崎,無(wú)法顯示出來(lái)。

這里還可以做一個(gè)散點(diǎn)圖河泳,也可以直觀地顯示出一些異常分布的數(shù)據(jù)點(diǎn)沃呢。

sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')

image
image
adata

    AnnData object with n_obs × n_vars = 335618 × 24888 
        obs: 'n_genes', 'percent_mito', 'n_counts'
        var: 'gene_ids', 'n_cells'

335618個(gè)細(xì)胞,24888個(gè)基因乔询;
下面這一步進(jìn)行真正的過濾樟插,根據(jù)上面做的圖,去除基因數(shù)目過多(大于等于4000)和線粒體基因占比過多(大于等于0.3)的細(xì)胞:

adata = adata[adata.obs['n_genes'] < 4000, :]
adata = adata[adata.obs['percent_mito'] < 0.3, :]

過濾后查看剩下多少細(xì)胞和基因竿刁。

adata

    View of AnnData object with n_obs × n_vars = 328435 × 24888 
        obs: 'n_genes', 'percent_mito', 'n_counts'
        var: 'gene_ids', 'n_cells'

328435個(gè)細(xì)胞黄锤,24888個(gè)基因。

數(shù)據(jù)標(biāo)準(zhǔn)化

在繪圖之前食拜,還要進(jìn)行一步數(shù)據(jù)標(biāo)準(zhǔn)化鸵熟,將表達(dá)量用對(duì)數(shù)計(jì)算一遍,這樣有利于繪圖和展示负甸。

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)

adata.raw = adata # 儲(chǔ)存標(biāo)準(zhǔn)化后的AnnaData Object

識(shí)別差異表達(dá)基因

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

sc.pl.highly_variable_genes(adata)

image

將保守的基因去除流强,留下差異表達(dá)的基因用于后續(xù)分析。

adata = adata[:, adata.var['highly_variable']]

adata

    View of AnnData object with n_obs × n_vars = 328435 × 1372 
        obs: 'n_genes', 'percent_mito', 'n_counts'
        var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'

328435個(gè)細(xì)胞呻待,1372個(gè)基因打月。

回歸每個(gè)細(xì)胞總計(jì)數(shù)和線粒體基因表達(dá)百分比的影響。將數(shù)據(jù)放縮到方差為1蚕捉。單細(xì)胞數(shù)據(jù)集可能包含“不感興趣”的變異來(lái)源奏篙。這不僅包括技術(shù)噪音,還包括批次效應(yīng),甚至包括生物變異來(lái)源(細(xì)胞周期階段)秘通。正如(Buettner, et al NBT为严,2015)中所建議的那樣,從分析中回歸這些信號(hào)可以改善下游維數(shù)減少和聚類肺稀。
這一步高內(nèi)存需求預(yù)警第股,建議清理電腦緩存,關(guān)閉后臺(tái)不使用的應(yīng)用话原。

sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])

    /Users/shinianyike/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
      from pandas.core import datetools
Scale each gene to unit variance. Clip values exceeding standard deviation 10.

sc.pp.scale(adata, max_value=10)

Step2, 主成分分析

主成分分析是一種將數(shù)據(jù)降維的分析方法夕吻,是考察多個(gè)變量間相關(guān)性一種多元統(tǒng)計(jì)方法,研究如何通過少數(shù)幾個(gè)主成分來(lái)揭示多個(gè)變量間的內(nèi)部結(jié)構(gòu)稿静,即從原始變量中導(dǎo)出少數(shù)幾個(gè)主成分梭冠,使它們盡可能多地保留原始變量的信息辕狰,且彼此間互不相關(guān).通常數(shù)學(xué)上的處理就是將原來(lái)P個(gè)指標(biāo)作線性組合改备,作為新的綜合指標(biāo)。

sc.tl.pca(adata, svd_solver='arpack') # PCA分析
sc.pl.pca(adata, color='CST3') #繪圖

image

作碎石圖蔓倍,觀測(cè)主成分的質(zhì)量悬钳。這個(gè)圖用于選擇后續(xù)應(yīng)該使用多少個(gè)PC,用于計(jì)算細(xì)胞間的相鄰距離偶翅。

sc.pl.pca_variance_ratio(adata, log=True)

image

在這里先將數(shù)據(jù)保存默勾,便于后續(xù)操作的更改。

adata.write("pca_results.h5ad")


Step3医咨, 聚類分析

計(jì)算細(xì)胞間的距離:
這里的參數(shù)就先按照默認(rèn)值設(shè)定:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

參數(shù)說(shuō)明:
n_neighbors指的是每個(gè)點(diǎn)的鄰近點(diǎn)的數(shù)量拾徙,據(jù)評(píng)論區(qū)@小光amateur 所說(shuō)neighbors的個(gè)數(shù)越多苇羡,聚類數(shù)會(huì)越少。
pc的數(shù)量依賴于上面所做的碎石圖环疼,一般是選在拐點(diǎn)處的的主成分,只需要一個(gè)粗略值朵耕,不同的pc數(shù)量所產(chǎn)生的聚類形狀也不同炫隶。我后來(lái)更改為PC=16,效果比下圖要好一些阎曹。

將距離嵌入圖中:

sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])

image
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)

image

聚類

sc.tl.louvain(adata)

sc.pl.umap(adata, color=['louvain'])

image

這里得到了29類細(xì)胞伪阶,每個(gè)顏色代表一種。
將數(shù)據(jù)保存处嫌。

adata.write("umap.h5ad")

尋找marker基因

marker基因通常是細(xì)胞表面抗原栅贴,用于定義出該細(xì)胞的細(xì)胞類型。
為了定義每個(gè)簇屬于什么細(xì)胞熏迹,根據(jù)基因的差異表達(dá)水平檐薯,將每個(gè)簇排名前25的基因?qū)С觥?/p>

sc.tl.rank_genes_groups(adata, 'louvain', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

image

Preprosessing the data

import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()

scanpy==1.4 anndata==0.6.19 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 

adata = sc.read_h5ad("/bone_marrow/scanpy/3_29_PC16_filterMore/umap_tsne_3_29.h5ad")

sc.tl.draw_graph(adata)

drawing single-cell graph using layout "fa"
    finished (1:06:05.80) --> added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)

sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")

image

Denoising the graph(will skip it next time!)

sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')

computing Diffusion Maps using n_comps=15(=n_dcs)
    eigenvalues of transition matrix
    [1.         0.99998933 0.9999825  0.9999806  0.9999773  0.99997413
     0.99997026 0.999969   0.99996084 0.9999516  0.9999409  0.9999385
     0.9999321  0.99992156 0.9999118 ]
    finished (0:11:57.51) --> added
    'X_diffmap', diffmap coordinates (adata.obsm)
    'diffmap_evals', eigenvalues of transition matrix (adata.uns)
computing neighbors
    finished (0:01:30.84) --> added to `.uns['neighbors']`
    'distances', distances for each pair of neighbors
    'connectivities', weighted adjacency matrix

sc.tl.draw_graph(adata)

drawing single-cell graph using layout "fa"
    finished (1:05:24.51) --> added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)

sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")

image

..didn't see any denoising effect

PAGA

Annotate the clusters using marker genes.

sc.tl.paga(adata, groups='louvain')

running PAGA
    finished (0:00:13.69) --> added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)

sc.pl.paga(adata, color=['louvain'],title = "")

--> added 'pos', the PAGA positions (adata.uns['paga'])

image
sc.pl.paga(adata, color=['CD34', 'GYPB', 'MS4A1', 'IL7R'])

--> added 'pos', the PAGA positions (adata.uns['paga'])

image

Annote groups with cell type

adata.obs['louvain'].cat.categories

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
      dtype='object')

adata.obs['louvain_anno'] = adata.obs['louvain']

# annote them with names
adata.obs['louvain_anno'].cat.categories = ['0/T', '1/B', '2', '3/T', '4/MDDC', '5', '6/MDDC', '7/NK', '8/MDDC', '9/CD8+T', '10/NK', '11/B', '12/NRBC',
       '13', '14/CD1C-CD141-DC', '15/pDC', '16/Macro,DC', '17/pDC', '18/DC', '19/transB,Plasmab', '20','21','22/B,NK','23']

Cluster Cell Type Marker Gene
0 T cell/IL-17Ralpha T cell IL7R, CD3E, CD3D
1 B cell MS4A1, CD79A
2 高表達(dá)核糖體蛋白基因
3 CD8+ T cell, T helper, angiogenic T cell CD3E, CXCR4, CD3D, CCL5, GZMK
4 Monocyte derived dendritic cell S100A8, S100A9
5 高表達(dá)核糖體蛋白基因
6 Monocyte derived dendritic cell S100A8, S100A9
7 NK cell PRF1, NKG7, KLRB1, KLRD1
8 Monocyte derived dendritic cell S100A8, S100A9
9 CD8+ T cell GZMK, CD3D, CD8A, NKG7 *
10 NK Cell GNLY, NKG7, PTPRC
11 B cell CD24, CD79A, CD37, CD79B
12 Red blood cell(Erythrocyte) HBB, HBA1,GYPA
13 not known
14 CD1C-CD141- dendritic cell FCGR3A, CST3
15 Plasmacytiod dendritic cell HSP90B1, SSR4, PDIA4, SEC11C, MZB1, UBE2J1, FKBP2, DERL3, HERPUD1, ITM2C
16 Macrophage/ dendritic cell LYZ, HLA-DQA1, AIF1, CD74, FCER1A, CST3
17 Plasmacytiod dendritic cell IRF8, TCF4, LILRA4 *
18 Megakaryocyte progenitor cell/Megakaryocyte PF4, PPBP, / GP9
19 transitional B cell / Plasmablast CD24, CD79B
20 not known
21 B cell MS4A1, CD79A, CD37, CD74
22 B cell , NK cell CD74,CD79A, NKG7, GZMH
23 not known

上面這個(gè)大家看看就好癣缅,我自己也不確定厨剪,請(qǐng)自行翻閱文獻(xiàn):逶汀!祷膳!

sc.tl.paga(adata, groups='louvain_anno')

running PAGA
    finished (0:00:13.55) --> added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)

sc.pl.paga(adata, threshold=0.03)

--> added 'pos', the PAGA positions (adata.uns['paga'])

image
adata

AnnData object with n_obs × n_vars = 315509 × 1314 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'louvain_anno'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'draw_graph', 'diffmap_evals', 'paga', 'louvain_sizes', 'louvain_anno_sizes', 'louvain_anno_colors'
    obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fa', 'X_diffmap'
    varm: 'PCs'

sc.tl.draw_graph(adata, init_pos='paga')

drawing single-cell graph using layout "fa"
    finished (1:03:55.77) --> added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)

Add pesudotime parameters

# the most primitive cell is refered as 0 persudotime.
# Group 13 is the nearest cell population to Hematopoietic stem cell.

adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno']  == '13')[0]
sc.tl.dpt(adata)

computing Diffusion Pseudotime using n_dcs=10
    finished (0:00:00.04) --> added
    'dpt_pseudotime', the pseudotime (adata.obs)

sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
                 legend_loc='right margin',title = ['','pseudotime'])

image
sc.pl.draw_graph(adata, color=['louvain_anno'],
                 legend_loc='right margin',title = ['']) #plot again to see full legends info

image

try other "iroot" setting

adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno']  == '5')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
                 legend_loc='right margin',title = ['','pseudotime'])

computing Diffusion Pseudotime using n_dcs=10
    finished (0:00:00.04) --> added
    'dpt_pseudotime', the pseudotime (adata.obs)

image

Several other cell types are chosen to be "root" for diffusion pseudotime, however the pseudotime graphs look no big different.


..it doesn't look meaningful. didn't see any trajectory to describe cell development.

I think the "denoising graph" step is to blame. Will skip it next time.
Otherwise i should zoom it into a specific cell population, but have no idea which kind of cell i should choose...

Beautify the graphs

Choose the colors of the clusters a bit more consistently.

pl.figure(figsize=(8, 2))
for i in range(28):
    pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()

image
zeileis_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = np.array(adata.uns['louvain_anno_colors'])

new_colors[[13]] = zeileis_colors[[12]]  # Stem(?) colors / green
new_colors[[12]] = zeileis_colors[[5]]  # Ery colors / red
new_colors[[4,6,8,15,17]] = zeileis_colors[[17,17,17,16,16]]  # monocyte derived dendritic cell and pDC/ yellow
new_colors[[14,16,18]] = zeileis_colors[[16,16,16]]  # DC / yellow
new_colors[[0,3,9]] = zeileis_colors[[6,6,6]]  # T cell / light blue
new_colors[[7,10]] = zeileis_colors[[0,0]]  # NK cell / dark blue
new_colors[[1,11,22,19]] = zeileis_colors[[22,22,22,21]]  # B cell / pink
new_colors[[21,23,20]] = zeileis_colors[[25,25,25]]  # Not known / grey
new_colors[[2, 5]] = zeileis_colors[[25, 25]]  # outliers / grey

adata.uns['louvain_anno_colors'] = new_colors

adata

AnnData object with n_obs × n_vars = 315509 × 1314 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'louvain_anno', 'dpt_pseudotime'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'draw_graph', 'diffmap_evals', 'paga', 'louvain_sizes', 'louvain_anno_sizes', 'louvain_anno_colors', 'iroot'
    obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fa', 'X_diffmap'
    varm: 'PCs'

sc.pl.draw_graph(adata, color=['louvain_anno'],
                 legend_loc='right margin',title = [''])

image

this is a piece of shit.(翻車了 陶衅,哈哈)
<article class="_2rhmJa" style="box-sizing: border-box; display: block; font-weight: 400; line-height: 1.8; margin-bottom: 20px; color: rgb(64, 64, 64); font-family: -apple-system, BlinkMacSystemFont, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Segoe UI", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">

與上次翻車實(shí)驗(yàn)的不同:
  1. 過濾了更多的細(xì)胞和基因。在處理數(shù)據(jù)時(shí)直晨,低質(zhì)量的細(xì)胞一定要清除掉搀军,告誡大家寧缺毋濫。勇皇。罩句。否則后續(xù)分析真的是一堆的噪點(diǎn)。
  2. 跳過了scanpy中的降噪步驟敛摘。scanpy的降噪算法做的真不咋地门烂,一個(gè)好圖降噪之后更加混亂了。這個(gè)故事告訴我們一個(gè)道理:不要迷信作者說(shuō)的話P忠M驮丁!

對(duì)翻車實(shí)驗(yàn)感興趣的話:http://www.reibang.com/p/e688646a451b


什么是軌跡推斷/偽時(shí)間分析捕虽?

軌跡推斷對(duì)應(yīng)英文是trajectory inference慨丐,偽時(shí)間分析對(duì)應(yīng)英文是pseudotime analysis。其實(shí)它們做的是同一件事泄私。

在單細(xì)胞轉(zhuǎn)錄組測(cè)序完成的時(shí)候房揭,無(wú)論是細(xì)胞在發(fā)育、分化還是凋亡的過程中晌端,它們的生命活動(dòng)已經(jīng)在這一時(shí)刻靜止了捅暴,因此我們可以將這些測(cè)序數(shù)據(jù)看作是這一個(gè)時(shí)刻細(xì)胞呈現(xiàn)的表達(dá)狀態(tài)的瞬時(shí)截圖。

而軌跡推斷/偽時(shí)間分析則構(gòu)建了一個(gè)細(xì)胞發(fā)育和分化的模型斩松。這一個(gè)瞬時(shí)截圖內(nèi)的細(xì)胞在各種各樣不同的發(fā)育狀態(tài)伶唯,有的發(fā)育早,有的發(fā)育晚惧盹,有的分化了乳幸,有的未分化,甚至有的處于中間態(tài)钧椰。而決定細(xì)胞分化往往就是基因表達(dá)的變化粹断。

軌跡推斷利用各類細(xì)胞基因表達(dá)的連續(xù)性建立了一個(gè)“分化軌跡”,再將這些細(xì)胞投射到這個(gè)“分化軌跡”上嫡霞。如此一來(lái)瓶埋,這個(gè)靜態(tài)的截圖就呈現(xiàn)出了動(dòng)態(tài)的過程。


PS:細(xì)胞類型分類真是一場(chǎng)艱苦卓絕的戰(zhàn)斗!Q病曾撤!一入文獻(xiàn)深似海!T畏唷<废ぁ!


0. 數(shù)據(jù)預(yù)處理

先加載需要的包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()

版本信息:

scanpy==1.4 anndata==0.6.19 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 

讀取數(shù)據(jù)(數(shù)據(jù)來(lái)源詳見http://www.reibang.com/p/b190efae4d31
這個(gè)數(shù)據(jù)是利用scanpy進(jìn)行聚類后的對(duì)象巫湘。

adata = sc.read_h5ad("/data1/zll/project/deepBase3/HCA/bone_marrow/scanpy/3_29_PC16_filterMore/umap_tsne_3_29.h5ad")

預(yù)處理數(shù)據(jù)装悲,計(jì)算距離并可視化。

sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")

image

1. Partition-based Graph Abstraction(PAGA)

這是一種基于空間劃分的抽提細(xì)胞分化“骨架”的一種算法尚氛,用于顯示細(xì)胞的分化軌跡诀诊,得到哪些cluster之間的關(guān)系更接近。

sc.tl.paga(adata, groups='louvain')
sc.pl.paga(adata, color=['louvain'],title = "")

image

AVP是造血干細(xì)胞表達(dá)的一個(gè)基因阅嘶,給這個(gè)基因上色属瓣,我們可以看到基本上都富集在了cluster13的位置。那么有很大可能這個(gè)cluster就是造血干細(xì)胞奈懒。

sc.pl.paga(adata, color=['AVP'])

image

2.注釋細(xì)胞類型

上面的每個(gè)cluster都是用數(shù)字編號(hào)代替奠涌,為了能更清楚地知道這些細(xì)胞之間的關(guān)系宪巨,將細(xì)胞類型的信息注釋上去磷杏。細(xì)胞類型的確定主要利用了marker基因以及現(xiàn)有的文獻(xiàn)和資料,是一個(gè)非常復(fù)雜的過程捏卓〖觯總之在與文獻(xiàn)的一番斗爭(zhēng)之后,找到了以下這些骨髓的細(xì)胞類型(僅供參考):

Cluster Cell Type Marker Gene
0 naive T cell CD3E, CD3D, RPS29
1 B cell MS4A1, CD79A
2 naive T cell RPL34,RPS6
3 CD8+ T cell CCL5, GZMK, IL32
4 Neutrophil S100A8, S100A9,VCAN
5 naive T cell RPL32,RPL13
6 Neutrophil FTL, S100A8, S100A9
7 NK cell PRF1, NKG7, KLRB1, KLRD1
8 Eosinophil LYZ, LGALS1,MT-CO3
9 CD8+ T cell GZMK, CD3D, CD8A, NKG7
10 Eosinophil MT-CO2, MT-CYB,MT-CO3
11 CD34+ pre-B CD79B, VPREB3
12 Nuceated Red blood cell(Erythroblast) HBB, HBA1,AHSP
13 CD34+ CMP,CD34+ HSC,Early-ERP HNRNPA1, BTF3,GNB2L1,NPM1
14 Monocyte FCGR3A, LST1, AIF1
15 Plasma Cell MZB1, SSR4, FKBP11
16 pre-dendritic cell CST3, HLA-DPA1,FCER1A
17 dendritic cell ITM2C, SEC61B,JCHAIN
18 Platelet PF4, PPBP, GP9
19 pro-B cell HMGB1, IGLL1,STMN1
20 Stromal cell AOX1, SEPP1
21 Eosinophil MT-CO2, MT-ND3,MT-CO3,
22 B cell , NK cell (mixture,not sure) CD74,CD79A, NKG7, GZMH
23 pro-B STMN1, TUBA1B

下面進(jìn)行注釋:

adata.obs['louvain'].cat.categories

返回共有24個(gè)cluster:

   Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
           '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
          dtype='object')

將編號(hào)所對(duì)應(yīng)的細(xì)胞類型注釋上去:

adata.obs['louvain_anno'] = adata.obs['louvain']

# annote them with names
adata.obs['louvain_anno'].cat.categories = ['0/nT', '1/B', '2nT', '3/CD8+T',
                                            '4/Neu', '5/nT', '6/Neu', '7/NK',
                                            '8/Eos', '9/CD8+T', '10/Eos', '11/pre-B',
                                            '12/NRBC','13/HSC', '14/Mon', '15/plasma', 
                                            '16/preDC', '17/DC', '18/plt', '19/pro-B', 
                                            '20/strom','21/Eos','22/','23/pro-B']

3. 計(jì)算 PAGA并可視化

sc.tl.paga(adata, groups='louvain_anno')
sc.pl.paga(adata, threshold=0.03)

image

于是就得到了這樣像“骨架”一樣的結(jié)果怠晴。這個(gè)圖顯示的就是細(xì)胞與細(xì)胞之間的關(guān)系遥金,距離越近表示關(guān)系越接近。

4. 利用PAGA重新計(jì)算細(xì)胞之間的距離

還記得我們第0步計(jì)算的距離嗎蒜田?現(xiàn)在我們要將細(xì)胞在PAGA這個(gè)“骨架”上重現(xiàn)出來(lái)稿械,就利用PAGA的計(jì)算結(jié)果,把細(xì)胞放上去冲粤。

sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color=['louvain_anno'], legend_loc='right margin')

image

5.(可選)美化圖片

Choose the colors of the clusters a bit more consistently.

pl.figure(figsize=(8, 2))
for i in range(28):
    pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()

image
zeileis_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = np.array(adata.uns['louvain_anno_colors'])

new_colors[[13]] = zeileis_colors[[17]]  # Stem colors / yellow
new_colors[[12]] = zeileis_colors[[5]]  # Ery colors / red
new_colors[[4,6]] = zeileis_colors[[12,12]] # Neutrophil / green
new_colors[[8,10]] = zeileis_colors[[11,11]] # Eosinophil / light red
new_colors[[14]] = zeileis_colors[[19]] # monocyte / light green
new_colors[[16,17]] = zeileis_colors[[18,26]]  # DC / yellow
new_colors[[0,2,3,5,9]] = zeileis_colors[[7,7,6,7,6]]  # naive T & CD8+T / light blue
new_colors[[7]] = zeileis_colors[[0]]  # NK cell / dark blue
new_colors[[1,11,19,23]] = zeileis_colors[[23,22,9,9]]  # B cell / pink

new_colors[[22]] = zeileis_colors[[25]]  # Not known / grey
new_colors[[15]] = zeileis_colors[[3]] #plasma cell / blue grey
new_colors[[18]] = zeileis_colors[[4]] #platelet / pink grey

adata.uns['louvain_anno_colors'] = new_colors

And add some white space to some cluster names.

sc.pl.paga_compare(
    adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
    legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=True)

--> added 'pos', the PAGA positions (adata.uns['paga'])
saving figure to file ./figures/paga_compare.pdf

image

6. 計(jì)算偽時(shí)間值

首先需要定義一個(gè)偽時(shí)間值為0的細(xì)胞群體美莫,一般來(lái)說(shuō)就是干細(xì)胞或者祖細(xì)胞√莶叮總之就是最原始的細(xì)胞類型厢呵。

# the most primitive cell is refered as 0 persudotime.
# Group 13 is the nearest cell population to Hematopoietic stem cell.
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno']  == '13/HSC')[0]
sc.tl.dpt(adata)

sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
                 legend_loc='right margin',title = ['','pseudotime'])

image

偽時(shí)間圖的分化軌跡無(wú)法用色彩體現(xiàn)出來(lái),可能是偽時(shí)間參數(shù)的問題傀顾。

上面的圖注被遮住了襟铭,所以單獨(dú)做一個(gè)圖。

#plot again to see full legends info
sc.pl.draw_graph(adata, color=['louvain_anno'],
                 legend_loc='right margin',title = ['']) 

image

保存結(jié)果文件:

adata.write("/data1/zll/project/deepBase3/HCA/bone_marrow/scanpy/3_31_traj/trajectory_3_31.h5ad")

關(guān)于scanpy如何進(jìn)行多樣本整合數(shù)據(jù)分析,還望道友們不吝賜教寒砖!

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末赐劣,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子哩都,更是在濱河造成了極大的恐慌隆豹,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,324評(píng)論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件茅逮,死亡現(xiàn)場(chǎng)離奇詭異璃赡,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)献雅,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,356評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門碉考,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人挺身,你說(shuō)我怎么就攤上這事侯谁。” “怎么了章钾?”我有些...
    開封第一講書人閱讀 162,328評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵墙贱,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我贱傀,道長(zhǎng)惨撇,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,147評(píng)論 1 292
  • 正文 為了忘掉前任府寒,我火速辦了婚禮魁衙,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘株搔。我一直安慰自己剖淀,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,160評(píng)論 6 388
  • 文/花漫 我一把揭開白布纤房。 她就那樣靜靜地躺著纵隔,像睡著了一般。 火紅的嫁衣襯著肌膚如雪炮姨。 梳的紋絲不亂的頭發(fā)上捌刮,一...
    開封第一講書人閱讀 51,115評(píng)論 1 296
  • 那天,我揣著相機(jī)與錄音剑令,去河邊找鬼糊啡。 笑死,一個(gè)胖子當(dāng)著我的面吹牛吁津,可吹牛的內(nèi)容都是我干的棚蓄。 我是一名探鬼主播堕扶,決...
    沈念sama閱讀 40,025評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼梭依!你這毒婦竟也來(lái)了稍算?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,867評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤役拴,失蹤者是張志新(化名)和其女友劉穎糊探,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體河闰,經(jīng)...
    沈念sama閱讀 45,307評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡科平,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,528評(píng)論 2 332
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了姜性。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片瞪慧。...
    茶點(diǎn)故事閱讀 39,688評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖部念,靈堂內(nèi)的尸體忽然破棺而出弃酌,到底是詐尸還是另有隱情,我是刑警寧澤儡炼,帶...
    沈念sama閱讀 35,409評(píng)論 5 343
  • 正文 年R本政府宣布妓湘,位于F島的核電站,受9級(jí)特大地震影響乌询,放射性物質(zhì)發(fā)生泄漏榜贴。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,001評(píng)論 3 325
  • 文/蒙蒙 一楣责、第九天 我趴在偏房一處隱蔽的房頂上張望竣灌。 院中可真熱鬧,春花似錦秆麸、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,657評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至坷随,卻和暖如春房铭,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背温眉。 一陣腳步聲響...
    開封第一講書人閱讀 32,811評(píng)論 1 268
  • 我被黑心中介騙來(lái)泰國(guó)打工缸匪, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人类溢。 一個(gè)月前我還...
    沈念sama閱讀 47,685評(píng)論 2 368
  • 正文 我出身青樓凌蔬,卻偏偏與公主長(zhǎng)得像露懒,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子砂心,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,573評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容