- 單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析|| scanpy教程:預(yù)處理與聚類
- 單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析|| scanpy教程:PAGA軌跡推斷
隨著單細(xì)胞技術(shù)的成熟,測序成本的降低恰起,單細(xì)胞的數(shù)據(jù)量和樣本量也日益增長状囱。我們知道單細(xì)胞轉(zhuǎn)錄組的一個(gè)主要應(yīng)用就是解釋細(xì)胞的異質(zhì)性仁连,那么蓝丙,不同器官邪铲,不同測序平臺(tái)硫豆,不同物種之間的單細(xì)胞數(shù)據(jù)何如整合分析呢龙巨?特別是在單細(xì)胞的數(shù)據(jù)維度這么高的前提下,顯然傳統(tǒng)的基于回歸的方法已經(jīng)不適用了熊响。于是出現(xiàn)了一批單細(xì)胞整合分析的工具旨别,它們大多數(shù)是在R生態(tài)條件下的。如:
在我們理解單細(xì)胞數(shù)據(jù)的時(shí)候一張cell X gene 的大表不能離開我們的腦海。
adata.to_df()
Out[24]:
RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
AAACCCAAGCGTATGG-1 0.0 0.0 ... 0.0 0.0
AAACCCAGTCCTACAA-1 0.0 0.0 ... 0.0 0.0
AAACCCATCACCTCAC-1 0.0 0.0 ... 0.0 0.0
AAACGCTAGGGCATGT-1 0.0 0.0 ... 0.0 0.0
AAACGCTGTAGGTACG-1 0.0 0.0 ... 0.0 0.0
... ... ... ... ...
TTTGTTGCAGGTACGA-1 0.0 0.0 ... 0.0 0.0
TTTGTTGCAGTCTCTC-1 0.0 0.0 ... 0.0 0.0
TTTGTTGGTAATTAGG-1 0.0 0.0 ... 0.0 0.0
TTTGTTGTCCTTGGAA-1 0.0 0.0 ... 0.0 0.0
TTTGTTGTCGCACGAC-1 0.0 0.0 ... 0.0 0.0
當(dāng)我們有多個(gè)樣本的時(shí)候就是有多張這樣的表,那讓我們自己手動(dòng)來整合這兩張表的話占拍,我們會(huì)怎么做呢权逗?
肯定是行列分別對齊把它們拼在一起啊苇羡,就像拼積木一樣的,但是這樣的結(jié)果就是:
兩個(gè)樣本在圖譜上完全的分開來了。我們不同平臺(tái)的樣本,相同的細(xì)胞類型應(yīng)該是在一起的啊绞铃。于是我們開始思考如何完成這樣的整合。
seurat提供了一套解決方案嫂侍,就是在數(shù)據(jù)集中構(gòu)建錨點(diǎn)儿捧,將不同數(shù)據(jù)集中相似的細(xì)胞錨在一起。
那么如何錨挑宠,選擇哪些特征來錨定菲盾,又開發(fā)出不同的算法。不管算法如何各淀,首先我們看看這種錨定可以為我們帶來什么懒鉴?相同的細(xì)胞類型mapping在一起,一個(gè)自然的作用就是用來mapping細(xì)胞類型未知的數(shù)據(jù)揪阿。
所以在scanpy中也如seurat一樣在多樣本分析中疗我,分別給出reference的方法和整合的方法。目前在scanpy中分別是ingest和BBKNN(Batch balanced kNN)南捂,當(dāng)然整合也是可以用來做reference的。scanpy.external.pp.mnn_correct應(yīng)該也是可以用的旧找。
先來看ingest溺健,通過投射到參考數(shù)據(jù)上的PCA(或備用模型)上,將一個(gè)adata的嵌入和注釋與一個(gè)參考數(shù)據(jù)集adata_ref集成在一起。該函數(shù)使用knn分類器來映射標(biāo)簽鞭缭,使用UMAP來映射嵌入剖膳。
再來看看bbknn是一個(gè)快速和直觀的批處理效果去除工具,可以直接在scanpy工作流中使用岭辣。它是scanpy.api.pp.neighbors()的替代方法吱晒,這兩個(gè)函數(shù)都創(chuàng)建了一個(gè)鄰居圖,以便后續(xù)在集群沦童、偽時(shí)間和UMAP可視化中使用仑濒。標(biāo)準(zhǔn)方法首先確定整個(gè)數(shù)據(jù)結(jié)構(gòu)中每個(gè)單元的k個(gè)最近鄰,然后將候選單元轉(zhuǎn)換為指數(shù)相關(guān)的連接偷遗,然后作為進(jìn)一步分析的基礎(chǔ)墩瞳。
那么我們就來看一下在scanpy的實(shí)現(xiàn)吧。
import scanpy as sc
import pandas as pd
import seaborn as sns
import sklearn
import sys
import scipy
import bbknn
sc.settings.verbosity = 1 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3))
scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.16.5 scipy==1.3.1 pandas==0.25.1 scikit-learn==0.21.3 statsmodels==0.10.1 python-igraph==0.8.0
ingest 注釋
adata_ref = sc.datasets.pbmc3k_processed() # this is an earlier version of the dataset from the pbmc3k tutorial
adata_ref
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
我們一次看看以下參考數(shù)據(jù)集都有哪些內(nèi)容:
adata_ref.obs
Out[9]:
n_genes percent_mito n_counts louvain
index
AAACATACAACCAC-1 781 0.030178 2419.0 CD4 T cells
AAACATTGAGCTAC-1 1352 0.037936 4903.0 B cells
AAACATTGATCAGC-1 1131 0.008897 3147.0 CD4 T cells
AAACCGTGCTTCCG-1 960 0.017431 2639.0 CD14+ Monocytes
AAACCGTGTATGCG-1 522 0.012245 980.0 NK cells
... ... ... ...
TTTCGAACTCTCAT-1 1155 0.021104 3459.0 CD14+ Monocytes
TTTCTACTGAGGCA-1 1227 0.009294 3443.0 B cells
TTTCTACTTCCTCG-1 622 0.021971 1684.0 B cells
TTTGCATGAGAGGC-1 454 0.020548 1022.0 B cells
TTTGCATGCCTCAC-1 724 0.008065 1984.0 CD4 T cells
[2638 rows x 4 columns]
adata_ref.var
Out[10]:
n_cells
index
TNFRSF4 155
CPSF3L 202
ATAD3C 9
C1orf86 501
RER1 608
...
ICOSLG 34
SUMO3 570
SLC19A1 31
S100B 94
PRMT2 588
[1838 rows x 1 columns]
adata_ref.uns['louvain_colors']
Out[14]:
array(['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b',
'#e377c2', '#bcbd22'], dtype='<U7')
adata_ref.obsm
Out[16]: AxisArrays with keys: X_pca, X_tsne, X_umap, X_draw_graph_fr
adata_ref.obsm['X_umap']
Out[17]:
array([[ 1.35285574, 2.26612719],
[-0.47802448, 7.87730423],
[ 2.16588875, -0.24481226],
...,
[ 0.34670979, 8.34967798],
[ 0.19864146, 9.56698797],
[ 2.62803322, 0.36722543]])
有沒有再次理解AnnData 這個(gè)對象的數(shù)據(jù)結(jié)構(gòu)呢氏豌?
可以看到在這個(gè)數(shù)據(jù)集中降維聚類都是做過的喉酌,我們可以畫個(gè)圖看看:
sc.pl.umap(adata_ref, color='louvain')
接下來我們看看要預(yù)測的數(shù)據(jù)集是怎樣的。
adata = sc.datasets.pbmc68k_reduced()
adata
AnnData object with n_obs × n_vars = 700 × 765
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
可見它也是降維聚類過的了泵喘。
sc.pl.umap(adata, color='louvain')
這個(gè)數(shù)據(jù)集并沒有得到細(xì)胞類型的定義泪电。
構(gòu)建注釋數(shù)據(jù)結(jié)構(gòu):
var_names = adata_ref.var_names.intersection(adata.var_names) # 取交集
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.tl.leiden(adata_ref)# 新的聚類方法
sc.pl.umap(adata_ref, color=['louvain','leiden'])
adata_ref
AnnData object with n_obs × n_vars = 2638 × 208
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'leiden'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['louvain','leiden'])
adata
AnnData object with n_obs × n_vars = 700 × 208
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'leiden'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
用ingest來做細(xì)胞注釋吧。
sc.tl.ingest(adata, adata_ref, obs='louvain')
adata.uns['louvain_colors'] = adata_ref.uns['louvain_colors'] # fix colors
我們來看看sc.tl.ingest的幫助文檔:
Help on function ingest in module scanpy.tools._ingest:
ingest(adata: anndata._core.anndata.AnnData, adata_ref: anndata._core.anndata.AnnData, obs: Union[str, Iterable[str], NoneType] = None, embedding_method: Union[str, Iterable[str]] = ('umap', 'pca'), labeling_method: str = 'knn', inplace: bool = True, **kwargs)
Map labels and embeddings from reference data to new data.
:tutorial:`integrating-data-using-ingest`
Integrates embeddings and annotations of an `adata` with a reference dataset
`adata_ref` through projecting on a PCA (or alternate
model) that has been fitted on the reference data. The function uses a knn
classifier for mapping labels and the UMAP package [McInnes18]_ for mapping
the embeddings.
.. note::
We refer to this *asymmetric* dataset integration as *ingesting*
annotations from reference data to new data. This is different from
learning a joint representation that integrates both datasets in an
unbiased way, as CCA (e.g. in Seurat) or a conditional VAE (e.g. in
scVI) would do.
You need to run :func:`~scanpy.pp.neighbors` on `adata_ref` before
passing it.
Parameters
----------
adata
The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond
to cells and columns to genes. This is the dataset without labels and
embeddings.
adata_ref
The annotated data matrix of shape `n_obs` × `n_vars`. Rows correspond
to cells and columns to genes.
Variables (`n_vars` and `var_names`) of `adata_ref` should be the same
as in `adata`.
This is the dataset with labels and embeddings
which need to be mapped to `adata`.
obs
Labels' keys in `adata_ref.obs` which need to be mapped to `adata.obs`
(inferred for observation of `adata`).
embedding_method
Embeddings in `adata_ref` which need to be mapped to `adata`.
The only supported values are 'umap' and 'pca'.
labeling_method
The method to map labels in `adata_ref.obs` to `adata.obs`.
The only supported value is 'knn'.
inplace
Only works if `return_joint=False`.
Add labels and embeddings to the passed `adata` (if `True`)
or return a copy of `adata` with mapped embeddings and labels.
Returns
-------
* if `inplace=False` returns a copy of `adata`
with mapped embeddings and labels in `obsm` and `obs` correspondingly
* if `inplace=True` returns `None` and updates `adata.obsm` and `adata.obs`
with mapped embeddings and labels
Example
-------
Call sequence:
import scanpy as sc
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
sc.tl.ingest(adata, adata_ref, obs='cell_type')
.. _ingest PBMC tutorial: https://scanpy-tutorials.readthedocs.io/en/latest/integrating-pbmcs-using-ingest.html
.. _ingest Pancreas tutorial: https://scanpy-tutorials.readthedocs.io/en/latest/integrating-pancreas-using-ingest.html
通過比較‘bulk_label’注釋和‘louvain’注釋纪铺,我們發(fā)現(xiàn)數(shù)據(jù)被合理地映射相速,只有樹突細(xì)胞的注釋似乎是含糊不清的,在adata中可能已經(jīng)是模糊的了霹陡。我們來對adata做進(jìn)一步的處理和蚪。
adata_concat = adata_ref.concatenate(adata, batch_categories=['ref', 'new'])
adata_concat.obs.louvain = adata_concat.obs.louvain.astype('category')
adata_concat.obs.louvain.cat.reorder_categories(adata_ref.obs.louvain.cat.categories, inplace=True) # fix category ordering
adata_concat.uns['louvain_colors'] = adata_ref.uns['louvain_colors'] # fix category colors
adata_concat
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
AnnData object with n_obs × n_vars = 3338 × 208
obs: 'G2M_score', 'S_score', 'batch', 'bulk_labels', 'leiden', 'louvain', 'n_counts', 'n_genes', 'percent_mito', 'phase'
var: 'n_cells-ref', 'n_counts-new', 'means-new', 'dispersions-new', 'dispersions_norm-new', 'highly_variable-new'
obsm: 'X_pca', 'X_umap'
雖然在單核細(xì)胞和樹突狀細(xì)胞簇中似乎存在一些批處理效應(yīng),但在其他方面烹棉,新數(shù)據(jù)被繪制得相對均勻攒霹。
巨核細(xì)胞只存在于adata_ref中,沒有來自adata映射的單元格浆洗。如果交換參考數(shù)據(jù)和查詢數(shù)據(jù)催束,巨核細(xì)胞不再作為單獨(dú)的集群出現(xiàn)。這是一個(gè)極端的情況伏社,因?yàn)閰⒖紨?shù)據(jù)非常小;但是抠刺,人們應(yīng)該始終質(zhì)疑參考數(shù)據(jù)是否包含足夠的生物變異,以便有意義地容納查詢數(shù)據(jù)摘昌。
使用BBKNN整合
sc.tl.pca(adata_concat)
sc.external.pp.bbknn(adata_concat, batch_key='batch') # running bbknn 1.3.6
sc.tl.umap(adata_concat)
sc.pl.umap(adata_concat, color=['batch', 'louvain'])
adata_concat
Out[45]:
AnnData object with n_obs × n_vars = 3338 × 208
obs: 'G2M_score', 'S_score', 'batch', 'bulk_labels', 'leiden', 'louvain', 'n_counts', 'n_genes', 'percent_mito', 'phase'
var: 'n_cells-ref', 'n_counts-new', 'means-new', 'dispersions-new', 'dispersions_norm-new', 'highly_variable-new'
uns: 'batch_colors', 'louvain_colors', 'pca', 'neighbors', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
BBKNN并不維持巨核細(xì)胞簇速妖。然而,它似乎更均勻地混合細(xì)胞聪黎。
一個(gè)例子使用BBKNN整合數(shù)據(jù)的例子
以下數(shù)據(jù)已在scGen論文[Lotfollahi19]中使用罕容。點(diǎn)擊pancreas下載數(shù)據(jù)。
它包含了來自4個(gè)不同研究(Segerstolpe16, Baron16, Wang16, Muraro16)的人類胰腺數(shù)據(jù),這些數(shù)據(jù)在單細(xì)胞數(shù)據(jù)集集成的開創(chuàng)性論文(Butler18, Haghverdi18)中被使用過锦秒,并在此后多次被使用露泊。
h5ad = 'E:\\learnscanpy\\data\\objects-pancreas\\pancreas.h5ad'
adata_all = sc.read_h5ad(h5ad)
adata_all
AnnData object with n_obs × n_vars = 14693 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
counts = adata_all.obs.celltype.value_counts()
counts
Out[173]:
alpha 4214
beta 3354
ductal 1804
acinar 1368
not applicable 1154
delta 917
gamma 571
endothelial 289
activated_stellate 284
dropped 178
quiescent_stellate 173
mesenchymal 80
macrophage 55
PSC 54
unclassified endocrine 41
co-expression 39
mast 32
epsilon 28
mesenchyme 27
schwann 13
t_cell 7
MHC class II 5
unclear 4
unclassified 2
Name: celltype, dtype: int64
adata_all.obs
Out[171]:
celltype sample ... n_counts louvain
index ...
human1_lib1.final_cell_0001-0 acinar Baron ... 2.241100e+04 2
human1_lib1.final_cell_0002-0 acinar Baron ... 2.794900e+04 2
human1_lib1.final_cell_0003-0 acinar Baron ... 1.689200e+04 2
human1_lib1.final_cell_0004-0 acinar Baron ... 1.929900e+04 2
human1_lib1.final_cell_0005-0 acinar Baron ... 1.506700e+04 2
... ... ... ... ...
reads.29499-3 ductal Wang ... 1.056558e+06 10
reads.29500-3 ductal Wang ... 9.926309e+05 10
reads.29501-3 beta Wang ... 1.751338e+06 10
reads.29502-3 dropped Wang ... 2.163764e+06 10
reads.29503-3 beta Wang ... 2.038979e+06 10
[14693 rows x 6 columns]
可以看出這個(gè)數(shù)據(jù)集已經(jīng)降維聚類好了,所以我們可以可視化一下:
sc.pl.umap(adata_all,color=['sample', 'celltype','louvain'])
樣本之間的批次很嚴(yán)重啊旅择。
去掉細(xì)胞數(shù)較小的小群惭笑,
minority_classes = counts.index[-5:].tolist() # get the minority classes
# ['schwann', 't_cell', 'MHC class II', 'unclear', 'unclassified']
adata_all = adata_all[ # actually subset
~adata_all.obs.celltype.isin(minority_classes)]
adata_all.obs.celltype.cat.reorder_categories( # reorder according to abundance
counts.index[:-5].tolist(), inplace=True)
adata_all.obs.celltype.value_counts()
Out[182]:
alpha 4214
beta 3354
ductal 1804
acinar 1368
not applicable 1154
delta 917
gamma 571
endothelial 289
activated_stellate 284
dropped 178
quiescent_stellate 173
mesenchymal 80
macrophage 55
PSC 54
unclassified endocrine 41
co-expression 39
mast 32
epsilon 28
mesenchyme 27
進(jìn)行pca降維和umap降維:
sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color=['batch', 'celltype'], palette=sc.pl.palettes.vega_20_scanpy)
下面我們使用BBKNN來整合數(shù)據(jù):
sc.external.pp.bbknn(adata_all, batch_key='batch')
sc.tl.umap(adata_all)
adata_all
sc.pl.umap(adata_all, color=['sample','batch', 'celltype'])
果然要比原始的數(shù)據(jù)好多了。但是改變的是什么生真?
AnnData object with n_obs × n_vars = 14662 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors', 'louvain_colors', 'umap', 'batch_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
如果想對其中某個(gè)樣本進(jìn)行單獨(dú)的注釋沉噩,可以用上面提到的ingest。選擇一個(gè)參考批次來訓(xùn)練模型和建立鄰域圖(這里是一個(gè)PCA)汇歹,并分離出所有其他批次屁擅。
adata_ref = adata_all[adata_all.obs.batch == '0']
adata_ref
Out[191]:
View of AnnData object with n_obs × n_vars = 8549 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors', 'louvain_colors', 'umap', 'batch_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref)
sc.tl.umap(adata_ref)
adata_ref
Out[197]:
AnnData object with n_obs × n_vars = 8549 × 2448
obs: 'celltype', 'sample', 'n_genes', 'batch', 'n_counts', 'louvain'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'louvain', 'neighbors', 'pca', 'sample_colors', 'louvain_colors', 'umap', 'batch_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
sc.pl.umap(adata_ref, color='celltype')
選取數(shù)據(jù)集用ingest于adata_ref進(jìn)行mapping:
adatas = [adata_all[adata_all.obs.batch == i].copy() for i in ['1', '2', '3']]
sc.settings.verbosity = 2 # a bit more logging
for iadata, adata in enumerate(adatas):
print(f'... integrating batch {iadata+1}')
adata.obs['celltype_orig'] = adata.obs.celltype # save the original cell type
sc.tl.ingest(adata, adata_ref, obs='celltype')
integrating batch 1
running ingest
finished (0:00:08)
integrating batch 2
running ingest
finished (0:00:06)
integrating batch 3
running ingest
finished (0:00:03)
adata_concat = adata_ref.concatenate(adatas)
adata_concat
Out[200]:
AnnData object with n_obs × n_vars = 14662 × 2448
obs: 'batch', 'celltype', 'celltype_orig', 'louvain', 'n_counts', 'n_genes', 'sample'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
obsm: 'X_pca', 'X_umap'
adata_concat.obs.celltype = adata_concat.obs.celltype.astype('category')
adata_concat.obs.celltype.cat.reorder_categories(adata_ref.obs.celltype.cat.categories, inplace=True) # fix category ordering
adata_concat.uns['celltype_colors'] = adata_ref.uns['celltype_colors'] # fix category coloring
sc.pl.umap(adata_concat, color=['celltype_orig','batch', 'celltype'])
與BBKNN的結(jié)果相比,這是以一種更加明顯的方式保持分群产弹。如果已經(jīng)觀察到一個(gè)想要的連續(xù)結(jié)構(gòu)(例如在造血數(shù)據(jù)集中)派歌,攝取允許容易地維持這個(gè)結(jié)構(gòu)。
一致性評(píng)估
adata_query = adata_concat[adata_concat.obs.batch.isin(['1', '2', '3'])]
View of AnnData object with n_obs × n_vars = 6113 × 2448
obs: 'batch', 'celltype', 'celltype_orig', 'louvain', 'n_counts', 'n_genes', 'sample'
var: 'n_cells-0', 'n_cells-1', 'n_cells-2', 'n_cells-3'
uns: 'celltype_colors', 'celltype_orig_colors', 'batch_colors'
obsm: 'X_pca', 'X_umap'
sc.pl.umap(
adata_query, color=['batch', 'celltype', 'celltype_orig'], wspace=0.4)
這個(gè)結(jié)果依然不能很好的反映一致性痰哨,讓我們首先關(guān)注與參考保守的細(xì)胞類型胶果,以簡化混淆矩陣的reads。
obs_query = adata_query.obs
conserved_categories = obs_query.celltype.cat.categories.intersection(obs_query.celltype_orig.cat.categories) # intersected categories
obs_query_conserved = obs_query.loc[obs_query.celltype.isin(conserved_categories) & obs_query.celltype_orig.isin(conserved_categories)] # intersect categories
obs_query_conserved.celltype.cat.remove_unused_categories(inplace=True) # remove unused categoriyes
obs_query_conserved.celltype_orig.cat.remove_unused_categories(inplace=True) # remove unused categoriyes
obs_query_conserved.celltype_orig.cat.reorder_categories(obs_query_conserved.celltype.cat.categories, inplace=True) # fix category ordering
obs_query_conserved
Out[214]:
batch celltype celltype_orig ... n_counts n_genes sample
D28.1_1-1-1 1 alpha alpha ... 2.322583e+04 5448 Muraro
D28.1_13-1-1 1 ductal ductal ... 2.334263e+04 5911 Muraro
D28.1_15-1-1 1 alpha alpha ... 2.713471e+04 5918 Muraro
D28.1_17-1-1 1 alpha alpha ... 1.581207e+04 4522 Muraro
D28.1_2-1-1 1 endothelial endothelial ... 3.173151e+04 6464 Muraro
... ... ... ... ... ... ...
reads.29498-3-3 3 ductal ductal ... 1.362606e+06 19950 Wang
reads.29499-3-3 3 ductal ductal ... 1.056558e+06 19950 Wang
reads.29500-3-3 3 ductal ductal ... 9.926309e+05 19950 Wang
reads.29501-3-3 3 beta beta ... 1.751338e+06 19950 Wang
reads.29503-3-3 3 beta beta ... 2.038979e+06 19950 Wang
pd.crosstab(obs_query_conserved.celltype, obs_query_conserved.celltype_orig)
Out[215]:
celltype_orig alpha beta ductal acinar delta gamma endothelial mast
celltype
alpha 1819 3 7 0 1 20 0 5
beta 49 803 3 1 10 26 0 0
ductal 8 5 693 263 0 0 0 0
acinar 1 3 2 145 0 3 0 0
delta 5 4 0 0 305 73 0 0
gamma 1 5 0 0 0 194 0 0
endothelial 2 0 0 0 0 0 36 0
mast 0 0 1 0 0 0 0 2
總的來說斤斧,保守的細(xì)胞類型也如預(yù)期的那樣被映射早抠。主要的例外是原始注釋中出現(xiàn)的一些腺泡細(xì)胞。然而撬讽,已經(jīng)觀察到參考數(shù)據(jù)同時(shí)具有腺泡和導(dǎo)管細(xì)胞蕊连,這就解釋了差異,并指出了初始注釋中潛在的不一致性游昼。
現(xiàn)在讓我們繼續(xù)看看所有的細(xì)胞類型甘苍。
pd.crosstab(adata_query.obs.celltype, adata_query.obs.celltype_orig)
Out[216]:
celltype_orig PSC acinar ... not applicable unclassified endocrine
celltype ...
alpha 0 0 ... 304 11
beta 0 1 ... 522 24
ductal 0 263 ... 106 1
acinar 0 145 ... 86 0
delta 0 0 ... 95 5
gamma 0 0 ... 14 0
endothelial 1 0 ... 7 0
activated_stellate 49 1 ... 17 0
quiescent_stellate 4 0 ... 1 0
macrophage 0 0 ... 1 0
mast 0 0 ... 1 0
[11 rows x 16 columns]
我們觀察到PSC(胰腺星狀細(xì)胞)細(xì)胞實(shí)際上只是不一致地注釋并正確地映射到“激活的星狀細(xì)胞”上。
此外烘豌,很高興看到“間充質(zhì)”和“間充質(zhì)”細(xì)胞都屬于同一類別载庭。但是,這個(gè)類別又是“activated_stellate”廊佩,而且可能是錯(cuò)誤的囚聚。這就是我們說的,算法只能接近真相标锄,而不能定義真相顽铸。
可視化分布的批次
通常,批量對應(yīng)的是想要比較的實(shí)驗(yàn)料皇。Scanpy提供了方便的可視化可能性跋破,主要有
- a density plot
- a partial visualization of a subset of categories/groups in an emnbedding
sc.tl.embedding_density(adata_concat, groupby='batch')
sc.pl.embedding_density(adata_concat, groupby='batch')
for batch in ['1', '2', '3']:
sc.pl.umap(adata_concat, color='batch', groups=[batch])
BBKNN: fast batch alignment of single cell transcriptomes
integrating-data-using-ingest