單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析|| scanpy教程：使用ingest和BBKNN整合多樣本

隨著單細(xì)胞技術(shù)的成熟，測序成本的降低恰起，單細(xì)胞的數(shù)據(jù)量和樣本量也日益增長状囱。我們知道單細(xì)胞轉(zhuǎn)錄組的一個(gè)主要應(yīng)用就是解釋細(xì)胞的異質(zhì)性仁连，那么蓝丙，不同器官邪铲，不同測序平臺(tái)硫豆，不同物種之間的單細(xì)胞數(shù)據(jù)何如整合分析呢龙巨？特別是在單細(xì)胞的數(shù)據(jù)維度這么高的前提下，顯然傳統(tǒng)的基于回歸的方法已經(jīng)不適用了熊响。于是出現(xiàn)了一批單細(xì)胞整合分析的工具旨别，它們大多數(shù)是在R生態(tài)條件下的。如：

在我們理解單細(xì)胞數(shù)據(jù)的時(shí)候一張cell X gene 的大表不能離開我們的腦海。

adata.to_df()
Out[24]: 
                    RP11-34P13.3  FAM138A  ...  AC213203.1  FAM231B
AAACCCAAGCGTATGG-1           0.0      0.0  ...         0.0      0.0
AAACCCAGTCCTACAA-1           0.0      0.0  ...         0.0      0.0
AAACCCATCACCTCAC-1           0.0      0.0  ...         0.0      0.0
AAACGCTAGGGCATGT-1           0.0      0.0  ...         0.0      0.0
AAACGCTGTAGGTACG-1           0.0      0.0  ...         0.0      0.0
                         ...      ...  ...         ...      ...
TTTGTTGCAGGTACGA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGCAGTCTCTC-1           0.0      0.0  ...         0.0      0.0
TTTGTTGGTAATTAGG-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCCTTGGAA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCGCACGAC-1           0.0      0.0  ...         0.0      0.0

當(dāng)我們有多個(gè)樣本的時(shí)候就是有多張這樣的表，那讓我們自己手動(dòng)來整合這兩張表的話占拍，我們會(huì)怎么做呢权逗？

肯定是行列分別對齊把它們拼在一起啊苇羡，就像拼積木一樣的，但是這樣的結(jié)果就是：

兩個(gè)樣本在圖譜上完全的分開來了。我們不同平臺(tái)的樣本，相同的細(xì)胞類型應(yīng)該是在一起的啊绞铃。于是我們開始思考如何完成這樣的整合。

seurat提供了一套解決方案嫂侍，就是在數(shù)據(jù)集中構(gòu)建錨點(diǎn)儿捧，將不同數(shù)據(jù)集中相似的細(xì)胞錨在一起。

那么如何錨挑宠，選擇哪些特征來錨定菲盾，又開發(fā)出不同的算法。不管算法如何各淀，首先我們看看這種錨定可以為我們帶來什么懒鉴？相同的細(xì)胞類型mapping在一起，一個(gè)自然的作用就是用來mapping細(xì)胞類型未知的數(shù)據(jù)揪阿。

所以在scanpy中也如seurat一樣在多樣本分析中疗我，分別給出reference的方法和整合的方法。目前在scanpy中分別是ingest和BBKNN（Batch balanced kNN）南捂，當(dāng)然整合也是可以用來做reference的。scanpy.external.pp.mnn_correct應(yīng)該也是可以用的旧找。

先來看ingest溺健，通過投射到參考數(shù)據(jù)上的PCA(或備用模型)上，將一個(gè)adata的嵌入和注釋與一個(gè)參考數(shù)據(jù)集adata_ref集成在一起。該函數(shù)使用knn分類器來映射標(biāo)簽鞭缭，使用UMAP來映射嵌入剖膳。

再來看看bbknn是一個(gè)快速和直觀的批處理效果去除工具，可以直接在scanpy工作流中使用岭辣。它是scanpy.api.pp.neighbors()的替代方法吱晒，這兩個(gè)函數(shù)都創(chuàng)建了一個(gè)鄰居圖，以便后續(xù)在集群沦童、偽時(shí)間和UMAP可視化中使用仑濒。標(biāo)準(zhǔn)方法首先確定整個(gè)數(shù)據(jù)結(jié)構(gòu)中每個(gè)單元的k個(gè)最近鄰，然后將候選單元轉(zhuǎn)換為指數(shù)相關(guān)的連接偷遗，然后作為進(jìn)一步分析的基礎(chǔ)墩瞳。

那么我們就來看一下在scanpy的實(shí)現(xiàn)吧。

import scanpy as sc
import pandas as pd
import seaborn as sns
import sklearn
import sys
import scipy
import bbknn

sc.settings.verbosity = 1             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3))

scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.16.5 scipy==1.3.1 pandas==0.25.1 scikit-learn==0.21.3 statsmodels==0.10.1 python-igraph==0.8.0

ingest 注釋

adata_ref = sc.datasets.pbmc3k_processed()  # this is an earlier version of the dataset from the pbmc3k tutorial

adata_ref
AnnData object with n_obs × n_vars = 2638 × 1838 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'

我們一次看看以下參考數(shù)據(jù)集都有哪些內(nèi)容：

adata_ref.obs
Out[9]: 
                  n_genes  percent_mito  n_counts          louvain
index                                                             
AAACATACAACCAC-1      781      0.030178    2419.0      CD4 T cells
AAACATTGAGCTAC-1     1352      0.037936    4903.0          B cells
AAACATTGATCAGC-1     1131      0.008897    3147.0      CD4 T cells
AAACCGTGCTTCCG-1      960      0.017431    2639.0  CD14+ Monocytes
AAACCGTGTATGCG-1      522      0.012245     980.0         NK cells
                  ...           ...       ...              ...
TTTCGAACTCTCAT-1     1155      0.021104    3459.0  CD14+ Monocytes
TTTCTACTGAGGCA-1     1227      0.009294    3443.0          B cells
TTTCTACTTCCTCG-1      622      0.021971    1684.0          B cells
TTTGCATGAGAGGC-1      454      0.020548    1022.0          B cells
TTTGCATGCCTCAC-1      724      0.008065    1984.0      CD4 T cells

[2638 rows x 4 columns]

adata_ref.var
Out[10]: 
         n_cells
index           
TNFRSF4      155
CPSF3L       202
ATAD3C         9
C1orf86      501
RER1         608
         ...
ICOSLG        34
SUMO3        570
SLC19A1       31
S100B         94
PRMT2        588

[1838 rows x 1 columns]

adata_ref.uns['louvain_colors']
Out[14]: 
array(['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b',
       '#e377c2', '#bcbd22'], dtype='<U7')

 adata_ref.obsm
Out[16]: AxisArrays with keys: X_pca, X_tsne, X_umap, X_draw_graph_fr

adata_ref.obsm['X_umap']
Out[17]: 
array([[ 1.35285574,  2.26612719],
       [-0.47802448,  7.87730423],
       [ 2.16588875, -0.24481226],
       ...,
       [ 0.34670979,  8.34967798],
       [ 0.19864146,  9.56698797],
       [ 2.62803322,  0.36722543]])

有沒有再次理解AnnData 這個(gè)對象的數(shù)據(jù)結(jié)構(gòu)呢氏豌？

可以看到在這個(gè)數(shù)據(jù)集中降維聚類都是做過的喉酌，我們可以畫個(gè)圖看看：

sc.pl.umap(adata_ref, color='louvain')

接下來我們看看要預(yù)測的數(shù)據(jù)集是怎樣的。

adata = sc.datasets.pbmc68k_reduced()
adata

AnnData object with n_obs × n_vars = 700 × 765 
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

可見它也是降維聚類過的了泵喘。

sc.pl.umap(adata, color='louvain')

這個(gè)數(shù)據(jù)集并沒有得到細(xì)胞類型的定義泪电。

構(gòu)建注釋數(shù)據(jù)結(jié)構(gòu)：

var_names = adata_ref.var_names.intersection(adata.var_names) # 取交集
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]

sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.tl.leiden(adata_ref)# 新的聚類方法
sc.pl.umap(adata_ref, color=['louvain','leiden'])
adata_ref

AnnData object with n_obs × n_vars = 2638 × 208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'leiden'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'

sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['louvain','leiden'])
adata

AnnData object with n_obs × n_vars = 700 × 208 
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'leiden'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

用ingest來做細(xì)胞注釋吧。

sc.tl.ingest(adata, adata_ref, obs='louvain')
adata.uns['louvain_colors'] = adata_ref.uns['louvain_colors']  # fix colors

我們來看看sc.tl.ingest的幫助文檔：

Help on function ingest in module scanpy.tools._ingest:

ingest(adata: anndata._core.anndata.AnnData, adata_ref: anndata._core.anndata.AnnData, obs: Union[str, Iterable[str], NoneType] = None, embedding_method: Union[str, Iterable[str]] = ('umap', 'pca'), labeling_method: str = 'knn', inplace: bool = True, **kwargs)
    Map labels and embeddings from reference data to new data.
    
    :tutorial:`integrating-data-using-ingest`
    
    Integrates embeddings and annotations of an `adata` with a reference dataset
    `adata_ref` through projecting on a PCA (or alternate
    model) that has been fitted on the reference data. The function uses a knn
    classifier for mapping labels and the UMAP package [McInnes18]_ for mapping
    the embeddings.
    
    .. note::
    
        We refer to this *asymmetric* dataset integration as *ingesting*
        annotations from reference data to new data. This is different from
        learning a joint representation that integrates both datasets in an
        unbiased way, as CCA (e.g. in Seurat) or a conditional VAE (e.g. in
        scVI) would do.
    
    You need to run :func:`~scanpy.pp.neighbors` on `adata_ref` before
    passing it.
    
    Parameters
    ----------
    adata
        The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond
        to cells and columns to genes. This is the dataset without labels and
        embeddings.
    adata_ref
        The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond
        to cells and columns to genes.
        Variables (`n_vars` and `var_names`) of `adata_ref` should be the same
        as in `adata`.
        This is the dataset with labels and embeddings
        which need to be mapped to `adata`.
    obs
        Labels' keys in `adata_ref.obs` which need to be mapped to `adata.obs`
        (inferred for observation of `adata`).
    embedding_method
        Embeddings in `adata_ref` which need to be mapped to `adata`.
        The only supported values are 'umap' and 'pca'.
    labeling_method
        The method to map labels in `adata_ref.obs` to `adata.obs`.
        The only supported value is 'knn'.
    inplace
        Only works if `return_joint=False`.
        Add labels and embeddings to the passed `adata` (if `True`)
        or return a copy of `adata` with mapped embeddings and labels.
    
    Returns
    -------
    * if `inplace=False` returns a copy of `adata`
      with mapped embeddings and labels in `obsm` and `obs` correspondingly
    * if `inplace=True` returns `None` and updates `adata.obsm` and `adata.obs`
      with mapped embeddings and labels
    
    Example
    -------
    Call sequence:
    
import scanpy as sc
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.tl.ingest(adata, adata_ref, obs='cell_type')
    
    .. _ingest PBMC tutorial: https://scanpy-tutorials.readthedocs.io/en/latest/integrating-pbmcs-using-ingest.html
    .. _ingest Pancreas tutorial: https://scanpy-tutorials.readthedocs.io/en/latest/integrating-pancreas-using-ingest.html

通過比較‘bulk_label’注釋和‘louvain’注釋纪铺，我們發(fā)現(xiàn)數(shù)據(jù)被合理地映射相速，只有樹突細(xì)胞的注釋似乎是含糊不清的，在adata中可能已經(jīng)是模糊的了霹陡。我們來對adata做進(jìn)一步的處理和蚪。

adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
adata_concat.obs.louvain  = adata_concat.obs.louvain.astype('category')
adata_concat.obs.louvain.cat.reorder_categories(adata_ref.obs.louvain.cat.categories, inplace=True)  # fix category ordering
adata_concat.uns['louvain_colors'] = adata_ref.uns['louvain_colors']  # fix category colors
adata_concat
sc.pl.umap(adata_concat, color=['batch', 'louvain'])


AnnData object with n_obs × n_vars = 3338 × 208 
    obs: 'G2M_score', 'S_score', 'batch', 'bulk_labels', 'leiden', 'louvain', 'n_counts', 'n_genes', 'percent_mito', 'phase'
    var: 'n_cells-ref', 'n_counts-new', 'means-new', 'dispersions-new', 'dispersions_norm-new', 'highly_variable-new'
    obsm: 'X_pca', 'X_umap'

雖然在單核細(xì)胞和樹突狀細(xì)胞簇中似乎存在一些批處理效應(yīng)，但在其他方面烹棉，新數(shù)據(jù)被繪制得相對均勻攒霹。
巨核細(xì)胞只存在于adata_ref中，沒有來自adata映射的單元格浆洗。如果交換參考數(shù)據(jù)和查詢數(shù)據(jù)催束，巨核細(xì)胞不再作為單獨(dú)的集群出現(xiàn)。這是一個(gè)極端的情況伏社，因?yàn)閰⒖紨?shù)據(jù)非常小;但是抠刺，人們應(yīng)該始終質(zhì)疑參考數(shù)據(jù)是否包含足夠的生物變異，以便有意義地容納查詢數(shù)據(jù)摘昌。

使用BBKNN整合

sc.tl.pca(adata_concat)
sc.external.pp.bbknn(adata_concat, batch_key='batch')  # running bbknn 1.3.6
sc.tl.umap(adata_concat)
sc.pl.umap(adata_concat, color=['batch', 'louvain'])

adata_concat
Out[45]: 
AnnData object with n_obs × n_vars = 3338 × 208 
    obs: 'G2M_score', 'S_score', 'batch', 'bulk_labels', 'leiden', 'louvain', 'n_counts', 'n_genes', 'percent_mito', 'phase'
    var: 'n_cells-ref', 'n_counts-new', 'means-new', 'dispersions-new', 'dispersions_norm-new', 'highly_variable-new'
    uns: 'batch_colors', 'louvain_colors', 'pca', 'neighbors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

BBKNN并不維持巨核細(xì)胞簇速妖。然而，它似乎更均勻地混合細(xì)胞聪黎。

一個(gè)例子使用BBKNN整合數(shù)據(jù)的例子

以下數(shù)據(jù)已在scGen論文[Lotfollahi19]中使用罕容。點(diǎn)擊pancreas下載數(shù)據(jù)。

它包含了來自4個(gè)不同研究(Segerstolpe16, Baron16, Wang16, Muraro16)的人類胰腺數(shù)據(jù)，這些數(shù)據(jù)在單細(xì)胞數(shù)據(jù)集集成的開創(chuàng)性論文(Butler18, Haghverdi18)中被使用過锦秒，并在此后多次被使用露泊。

h5ad = 'E:\\learnscanpy\\data\\objects-pancreas\\pancreas.h5ad'
adata_all = sc.read_h5ad(h5ad)
adata_all

AnnData object with n_obs × n_vars = 14693 × 2448 
    obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
    var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
    uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

 counts = adata_all.obs.celltype.value_counts()
counts
Out[173]: 
alpha                     4214
beta                      3354
ductal                    1804
acinar                    1368
not applicable            1154
delta                      917
gamma                      571
endothelial                289
activated_stellate         284
dropped                    178
quiescent_stellate         173
mesenchymal                 80
macrophage                  55
PSC                         54
unclassified endocrine      41
co-expression               39
mast                        32
epsilon                     28
mesenchyme                  27
schwann                     13
t_cell                       7
MHC class II                 5
unclear                      4
unclassified                 2
Name: celltype, dtype: int64

adata_all.obs
Out[171]: 
                              celltype sample  ...      n_counts louvain
index                                          ...                      
human1_lib1.final_cell_0001-0   acinar  Baron  ...  2.241100e+04       2
human1_lib1.final_cell_0002-0   acinar  Baron  ...  2.794900e+04       2
human1_lib1.final_cell_0003-0   acinar  Baron  ...  1.689200e+04       2
human1_lib1.final_cell_0004-0   acinar  Baron  ...  1.929900e+04       2
human1_lib1.final_cell_0005-0   acinar  Baron  ...  1.506700e+04       2
                               ...    ...  ...           ...     ...
reads.29499-3                   ductal   Wang  ...  1.056558e+06      10
reads.29500-3                   ductal   Wang  ...  9.926309e+05      10
reads.29501-3                     beta   Wang  ...  1.751338e+06      10
reads.29502-3                  dropped   Wang  ...  2.163764e+06      10
reads.29503-3                     beta   Wang  ...  2.038979e+06      10

[14693 rows x 6 columns]

可以看出這個(gè)數(shù)據(jù)集已經(jīng)降維聚類好了，所以我們可以可視化一下：

sc.pl.umap(adata_all,color=['sample', 'celltype','louvain'])

樣本之間的批次很嚴(yán)重啊旅择。

去掉細(xì)胞數(shù)較小的小群惭笑，

minority_classes = counts.index[-5:].tolist()        # get the minority classes

# ['schwann', 't_cell', 'MHC class II', 'unclear', 'unclassified']

adata_all = adata_all[                               # actually subset
    ~adata_all.obs.celltype.isin(minority_classes)]
adata_all.obs.celltype.cat.reorder_categories(       # reorder according to abundance
    counts.index[:-5].tolist(), inplace=True)

adata_all.obs.celltype.value_counts()
Out[182]: 
alpha                     4214
beta                      3354
ductal                    1804
acinar                    1368
not applicable            1154
delta                      917
gamma                      571
endothelial                289
activated_stellate         284
dropped                    178
quiescent_stellate         173
mesenchymal                 80
macrophage                  55
PSC                         54
unclassified endocrine      41
co-expression               39
mast                        32
epsilon                     28
mesenchyme                  27

進(jìn)行pca降維和umap降維：

sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color=['batch', 'celltype'], palette=sc.pl.palettes.vega_20_scanpy)

下面我們使用BBKNN來整合數(shù)據(jù)：

sc.external.pp.bbknn(adata_all, batch_key='batch')
sc.tl.umap(adata_all)
adata_all
sc.pl.umap(adata_all, color=['sample','batch', 'celltype'])

果然要比原始的數(shù)據(jù)好多了。但是改變的是什么生真？

AnnData object with n_obs × n_vars = 14662 × 2448 
    obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
    var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
    uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors', 'louvain_colors', 'umap', 'batch_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

如果想對其中某個(gè)樣本進(jìn)行單獨(dú)的注釋沉噩，可以用上面提到的ingest。選擇一個(gè)參考批次來訓(xùn)練模型和建立鄰域圖(這里是一個(gè)PCA)汇歹，并分離出所有其他批次屁擅。

adata_ref = adata_all[adata_all.obs.batch == '0']
adata_ref
Out[191]: 
View of AnnData object with n_obs × n_vars = 8549 × 2448 
    obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
    var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
    uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors', 'louvain_colors', 'umap', 'batch_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)

adata_ref
Out[197]: 
AnnData object with n_obs × n_vars = 8549 × 2448 
    obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
    var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
    uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors', 'louvain_colors', 'umap', 'batch_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

sc.pl.umap(adata_ref, color='celltype')

選取數(shù)據(jù)集用ingest于adata_ref進(jìn)行mapping：

adatas = [adata_all[adata_all.obs.batch == i].copy() for i in ['1', '2', '3']]
sc.settings.verbosity = 2  # a bit more logging
for iadata, adata in enumerate(adatas):
    print(f'... integrating batch {iadata+1}')
    adata.obs['celltype_orig'] = adata.obs.celltype  # save the original cell type
    sc.tl.ingest(adata, adata_ref, obs='celltype')

integrating batch 1
running ingest
    finished (0:00:08)
integrating batch 2
running ingest
    finished (0:00:06)
integrating batch 3
running ingest
    finished (0:00:03)

adata_concat = adata_ref.concatenate(adatas)
adata_concat

Out[200]: 
AnnData object with n_obs × n_vars = 14662 × 2448 
    obs: 'batch', 'celltype', 'celltype_orig', 'louvain', 'n_counts', 'n_genes', 'sample'
    var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
    obsm: 'X_pca', 'X_umap'

adata_concat.obs.celltype = adata_concat.obs.celltype.astype('category')
adata_concat.obs.celltype.cat.reorder_categories(adata_ref.obs.celltype.cat.categories, inplace=True)  # fix category ordering
adata_concat.uns['celltype_colors'] = adata_ref.uns['celltype_colors']  # fix category coloring

sc.pl.umap(adata_concat, color=['celltype_orig','batch', 'celltype'])

與BBKNN的結(jié)果相比，這是以一種更加明顯的方式保持分群产弹。如果已經(jīng)觀察到一個(gè)想要的連續(xù)結(jié)構(gòu)(例如在造血數(shù)據(jù)集中)派歌，攝取允許容易地維持這個(gè)結(jié)構(gòu)。

一致性評(píng)估

adata_query = adata_concat[adata_concat.obs.batch.isin(['1', '2', '3'])]

View of AnnData object with n_obs × n_vars = 6113 × 2448 
    obs: 'batch', 'celltype', 'celltype_orig', 'louvain', 'n_counts', 'n_genes', 'sample'
    var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
    uns: 'celltype_colors', 'celltype_orig_colors', 'batch_colors'
    obsm: 'X_pca', 'X_umap'

sc.pl.umap(
    adata_query, color=['batch', 'celltype', 'celltype_orig'], wspace=0.4)

這個(gè)結(jié)果依然不能很好的反映一致性痰哨，讓我們首先關(guān)注與參考保守的細(xì)胞類型胶果，以簡化混淆矩陣的reads。

obs_query = adata_query.obs
conserved_categories = obs_query.celltype.cat.categories.intersection(obs_query.celltype_orig.cat.categories)  # intersected categories
obs_query_conserved = obs_query.loc[obs_query.celltype.isin(conserved_categories) & obs_query.celltype_orig.isin(conserved_categories)]  # intersect categories
obs_query_conserved.celltype.cat.remove_unused_categories(inplace=True)  # remove unused categoriyes
obs_query_conserved.celltype_orig.cat.remove_unused_categories(inplace=True)  # remove unused categoriyes
obs_query_conserved.celltype_orig.cat.reorder_categories(obs_query_conserved.celltype.cat.categories, inplace=True)  # fix category ordering

obs_query_conserved
Out[214]: 
                batch     celltype celltype_orig  ...      n_counts  n_genes  sample
D28.1_1-1-1         1        alpha         alpha  ...  2.322583e+04     5448  Muraro
D28.1_13-1-1        1       ductal        ductal  ...  2.334263e+04     5911  Muraro
D28.1_15-1-1        1        alpha         alpha  ...  2.713471e+04     5918  Muraro
D28.1_17-1-1        1        alpha         alpha  ...  1.581207e+04     4522  Muraro
D28.1_2-1-1         1  endothelial   endothelial  ...  3.173151e+04     6464  Muraro
              ...          ...           ...  ...           ...      ...     ...
reads.29498-3-3     3       ductal        ductal  ...  1.362606e+06    19950    Wang
reads.29499-3-3     3       ductal        ductal  ...  1.056558e+06    19950    Wang
reads.29500-3-3     3       ductal        ductal  ...  9.926309e+05    19950    Wang
reads.29501-3-3     3         beta          beta  ...  1.751338e+06    19950    Wang
reads.29503-3-3     3         beta          beta  ...  2.038979e+06    19950    Wang

pd.crosstab(obs_query_conserved.celltype, obs_query_conserved.celltype_orig)
Out[215]: 
celltype_orig  alpha  beta  ductal  acinar  delta  gamma  endothelial  mast
celltype                                                                   
alpha           1819     3       7       0      1     20            0     5
beta              49   803       3       1     10     26            0     0
ductal             8     5     693     263      0      0            0     0
acinar             1     3       2     145      0      3            0     0
delta              5     4       0       0    305     73            0     0
gamma              1     5       0       0      0    194            0     0
endothelial        2     0       0       0      0      0           36     0
mast               0     0       1       0      0      0            0     2

總的來說斤斧，保守的細(xì)胞類型也如預(yù)期的那樣被映射早抠。主要的例外是原始注釋中出現(xiàn)的一些腺泡細(xì)胞。然而撬讽，已經(jīng)觀察到參考數(shù)據(jù)同時(shí)具有腺泡和導(dǎo)管細(xì)胞蕊连，這就解釋了差異，并指出了初始注釋中潛在的不一致性游昼。

現(xiàn)在讓我們繼續(xù)看看所有的細(xì)胞類型甘苍。

 pd.crosstab(adata_query.obs.celltype, adata_query.obs.celltype_orig)
Out[216]: 
celltype_orig       PSC  acinar  ...  not applicable  unclassified endocrine
celltype                         ...                                        
alpha                 0       0  ...             304                      11
beta                  0       1  ...             522                      24
ductal                0     263  ...             106                       1
acinar                0     145  ...              86                       0
delta                 0       0  ...              95                       5
gamma                 0       0  ...              14                       0
endothelial           1       0  ...               7                       0
activated_stellate   49       1  ...              17                       0
quiescent_stellate    4       0  ...               1                       0
macrophage            0       0  ...               1                       0
mast                  0       0  ...               1                       0

[11 rows x 16 columns]

我們觀察到PSC(胰腺星狀細(xì)胞)細(xì)胞實(shí)際上只是不一致地注釋并正確地映射到“激活的星狀細(xì)胞”上。
此外烘豌，很高興看到“間充質(zhì)”和“間充質(zhì)”細(xì)胞都屬于同一類別载庭。但是，這個(gè)類別又是“activated_stellate”廊佩，而且可能是錯(cuò)誤的囚聚。這就是我們說的，算法只能接近真相标锄，而不能定義真相顽铸。

可視化分布的批次

通常，批量對應(yīng)的是想要比較的實(shí)驗(yàn)料皇。Scanpy提供了方便的可視化可能性跋破，主要有

a density plot
a partial visualization of a subset of categories/groups in an emnbedding

sc.tl.embedding_density(adata_concat, groupby='batch')
sc.pl.embedding_density(adata_concat, groupby='batch')

for batch in ['1', '2', '3']:
    sc.pl.umap(adata_concat, color='batch', groups=[batch])

BBKNN: fast batch alignment of single cell transcriptomes
integrating-data-using-ingest

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末簸淀，一起剝皮案震驚了整個(gè)濱河市瓶蝴，隨后出現(xiàn)的幾起案子毒返，更是在濱河造成了極大的恐慌，老刑警劉巖舷手，帶你破解...
沈念sama閱讀 206,013評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件拧簸，死亡現(xiàn)場離奇詭異，居然都是意外死亡男窟，警方通過查閱死者的電腦和手機(jī)盆赤，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來歉眷，“玉大人牺六，你說我怎么就攤上這事『辜瘢” “怎么了淑际？”我有些...
開封第一講書人閱讀 152,370評(píng)論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長扇住。經(jīng)常有香客問我春缕，道長，這世上最難降的妖魔是什么艘蹋？我笑而不...
開封第一講書人閱讀 55,168評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任锄贼，我火速辦了婚禮，結(jié)果婚禮上女阀，老公的妹妹穿的比我還像新娘宅荤。我一直安慰自己，他們只是感情好浸策，可當(dāng)我...
茶點(diǎn)故事閱讀 64,153評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布冯键。她就那樣靜靜地躺著，像睡著了一般的榛。火紅的嫁衣襯著肌膚如雪琼了。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 48,954評(píng)論 1贊 283
城市分裂傳說
那天夫晌，我揣著相機(jī)與錄音雕薪，去河邊找鬼。笑死晓淀，一個(gè)胖子當(dāng)著我的面吹牛所袁，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播凶掰，決...
沈念sama閱讀 38,271評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼燥爷，長吁一口氣：“原來是場噩夢啊……” “哼蜈亩！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起前翎，我...
開封第一講書人閱讀 36,916評(píng)論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對情侶失蹤稚配，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后港华，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體道川，經(jīng)...
沈念sama閱讀 43,382評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,877評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年立宜，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了冒萄。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 37,989評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡橙数，死狀恐怖尊流，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情灯帮，我是刑警寧澤崖技，帶...
沈念sama閱讀 33,624評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站施流，受9級(jí)特大地震影響响疚，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜瞪醋，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,209評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一忿晕、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧银受，春花似錦践盼、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評(píng)論 0贊 19
一樁弒父案咕幻，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至顶霞，卻和暖如春肄程，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背选浑。一陣腳步聲響...
開封第一講書人閱讀 31,418評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工蓝厌，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人古徒。一個(gè)月前我還...
沈念sama閱讀 45,401評(píng)論 2贊 352
代替公主和親
正文我出身青樓拓提，卻偏偏與公主長得像，于是被迫代替她去往敵國和親隧膘。傳聞我的和親對象是個(gè)殘疾皇子代态，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,700評(píng)論 2贊 345