Cell2location 是一個原則性的貝葉斯模型,它可以解析空間轉(zhuǎn)錄數(shù)據(jù)中的細(xì)胞類型籽腕,并創(chuàng)建不同組織的全面細(xì)胞圖乾蓬。Cell2location 解釋了變異的技術(shù)來源河劝,并借用了不同位置的統(tǒng)計強度壁榕,從而使單細(xì)胞和空間轉(zhuǎn)錄組的集成比現(xiàn)有工具具有更高的靈敏度和分辨率。這是通過估計哪種細(xì)胞類型的組合赎瞎,其細(xì)胞豐度可以給出空間數(shù)據(jù)中的mRNA counts數(shù)牌里,同時模擬技術(shù)效應(yīng)(平臺/技術(shù)效應(yīng),污染 RNA务甥,無法解釋的方差)來實現(xiàn)的二庵。
Cell2location需要未轉(zhuǎn)化、未標(biāo)準(zhǔn)化的空間mRNA counts數(shù)作為輸入缓呛;還需要提供給Cell2location每個位置被期望的細(xì)胞豐度催享,這是用來作為一個估計完整細(xì)胞豐度的先導(dǎo),這個數(shù)值取決于組織哟绊,可以通過計算配對組織學(xué)圖像中少數(shù)部位的細(xì)胞核來估計因妙,可以是近似的。
Cell2location提供了從 scRNA-seq 數(shù)據(jù)中估計參考細(xì)胞類型特征的兩種方法:
- 一種基于負(fù)二項回歸的統(tǒng)計方法票髓。開發(fā)者通常推薦使用 NB 回歸攀涵,它允許將跨技術(shù)和批次處理的數(shù)據(jù)有力地結(jié)合起來,從而提高了空間映射的準(zhǔn)確性洽沟。
- 硬編碼計算個體基因的每簇平均 mRNA counts數(shù)(
cell2location.cluster_averages.compute_cluster_averages
)以故。當(dāng)批次效應(yīng)很小時,這種計算每個集群平均值的更快的硬編碼方法提供了同樣高的準(zhǔn)確性裆操。我們還建議對于非 UMI 技術(shù)使用硬編碼方法怒详,如 Smart-Seq 2。
加載需要的包
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and branch == "stable":
!pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cell2location
import scvi
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns
#定義分析結(jié)果保存在哪
results_folder = './results/lymph_nodes_analysis/'
# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'
加載 Visium 和 scRNA-seq 參考數(shù)據(jù)
#通過scanpy導(dǎo)入公開發(fā)表的淋巴結(jié)空間轉(zhuǎn)錄組數(shù)據(jù)
adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names#_names:提取行名
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None#_names.name = None將名稱設(shè)為無
注:線粒體編碼基因(基因名稱以 MT 或 MT-開頭)對空間定位無關(guān)踪区,因為它們的表達代表了單個細(xì)胞和細(xì)胞核數(shù)據(jù)中的技術(shù)性假象昆烁,而不是線粒體的生物豐度。然而缎岗,這些基因在每個位置的mRNA中占15-40% 静尼。因此,為了避免繪制偽影传泊,強烈建議移除線粒體基因鼠渺。
# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]
# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()#傳遞了稀疏矩陣,但需要密集數(shù)據(jù)眷细。使用X.toarray()轉(zhuǎn)換為密集的numpy數(shù)組
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
已發(fā)表的淋巴結(jié) scRNA-seq 數(shù)據(jù)集通常缺乏足夠的代表性的生發(fā)中心相關(guān)的免疫細(xì)胞群體拦盹,由于患者捐贈者的年齡。因此薪鹦,我們將涵蓋淋巴結(jié)掌敬、脾臟和扁桃體的 scRNA-seq 數(shù)據(jù)集包含在單細(xì)胞參考數(shù)據(jù)中,以確保在空間轉(zhuǎn)錄數(shù)據(jù)集中能夠捕捉到可能存在的免疫細(xì)胞狀態(tài)的全部多樣性池磁。
在這里奔害,我們下載這個數(shù)據(jù)集,導(dǎo)入到anndata并改變變量名稱為 ENSEMBL 基因標(biāo)識符地熄。
# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
!cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
# Read data
adata_ref = sc.read(f'./data/sc.h5ad')
# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None
# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)
# filter the object
adata_ref = adata_ref[:, selected].copy()
參考細(xì)胞類型特征的估計(NB 回歸)
# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
# 10X reaction / sample / batch
batch_key='Sample',
# cell type, covariate used for constructing signatures
labels_key='Subset',
# multiplicative technical effects (platform, 3' vs 5', donor effect)
categorical_covariate_keys=['Method']
)
scvi.data.view_anndata_setup(adata_ref)
# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)
# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)
# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)
# Save model
mod.save(f"{ref_run_name}", overwrite=True)
# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)
adata_file
檢查QC后的圖
- 重建準(zhǔn)確性去評估是否有任何組織存在問題
- 由于批次效應(yīng)华临,估計的表達特征不同于每個集群中的平均表達。對于不受批次效應(yīng)影響的 scRNA-seq 數(shù)據(jù)集(這個數(shù)據(jù)集) 端考,可以使用聚類平均表達式來代替使用模型估計特征雅潭。當(dāng)這個圖與對角線圖非常不同時(例如 y 軸上的值非常低,密度無處不在) 却特,它表明特征估計存在問題扶供。
mod.plot_QC()
#模型和輸出 h5ad 可以像下面這樣加載:
mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]
Cell2location: 空間比對
# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()
# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)
在這里,需要指定2個用戶提供的超參數(shù)(
n_cells_per_location
和 detection_alpha
)注:雖然可以經(jīng)常使用
detection_alpha
超參數(shù)的默認(rèn)值裂明,但是將預(yù)期的細(xì)胞豐度 n_cells per_location
調(diào)整到每個組織是有用的椿浓。這個值可以通過成對的組織學(xué)圖像估計,如上面的注釋所述闽晦。將本教程中提供的值(
n_cells_ per_location = 30
)更改為在您的組織中觀察到的值扳碍。
# create and train the model
mod = cell2location.models.Cell2location(
adata_vis, cell_state_df=inf_aver,
# the expected average cell abundance: tissue-dependent
# hyper-prior which can be estimated from paired histology:
N_cells_per_location=30,
# hyperparameter controlling normalisation of
# within-experiment variation in RNA detection (using default here):
detection_alpha=200
)
mod.train(max_epochs=30000,
# train using full data (batch_size=None)
batch_size=None,
# use all data points in training because
# we need to estimate cell abundance at all locations
train_size=1,
use_gpu=True)
# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)
# Save model
mod.save(f"{run_name}", overwrite=True)
# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
#模型和輸出 h5ad 可以像下面這樣加載:
mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
adata_file = f"{run_name}/sp.h5ad"
adata_vis = sc.read_h5ad(adata_file)
# Examine reconstruction accuracy to assess if there are any issues with mapping
# the plot should be roughly diagonal, strong deviations will signal problems
mod.plot_QC()
當(dāng)整合多個空間批次時和當(dāng)處理載玻片中檢測到的 RNA 差異很大的數(shù)據(jù)集時(這在組織學(xué)上無法用高細(xì)胞密度來解釋) ,評估 cell2location 是否使這些影響恢復(fù)正常是很重要的仙蛉。
期望看到的是不同批次中有相似的總細(xì)胞數(shù)和不同的RNA檢測靈敏度(這都可以通過cell2location評估)笋敞。
期望組織學(xué)中的總細(xì)胞數(shù)量反映出高細(xì)胞密度。
Visualising cell abundance in spatial coordinates
# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black','figure.figsize': [4.5, 5]}):
sc.pl.spatial(slide, cmap='magma',
# show first 8 cell types
color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC','B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
ncols=4, size=1.3,
img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2',
save=True
)
# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial
# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
import matplotlib.pyplot as plt
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(
adata=slide,
# labels to show on a plot
color=clust_col, labels=clust_labels,
show_img=True,
# 'fast' (white background) or 'dark_background'
style='fast',
# limit color scale at 99.2% quantile of cell abundance
max_color_quantile=0.992,
# size of locations (adjust depending on figure size)
circle_diameter=6,
colorbar_position='right'
)
plt.savefig('showing_multiple_cell_types_in_one_panel.pdf')
下游分析
基于 Leiden 聚類的離散組織區(qū)域識別
我們通過使用由cell2location估計的細(xì)胞豐度來確定細(xì)胞組成不同的組織區(qū)域荠瘪。
我們通過估計每種細(xì)胞類型的細(xì)胞豐度來聚集 Visium 點夯巷,從而找到組織區(qū)域。
首先構(gòu)造了一個 k 鄰近的 neigbour (KNN)圖表示細(xì)胞豐度估計值中位置的相似性哀墓,然后應(yīng)用 Leiden 聚類算法進行聚類鞭莽。鄰近區(qū)域的數(shù)量應(yīng)適應(yīng)數(shù)據(jù)集的大小和解剖學(xué)定義區(qū)域的大小(例如海馬區(qū)域相當(dāng)小,因此可能被大n_neighbors
所掩蓋)麸祷。這可以做一個范圍的 KNN 近鄰和Leiden聚類分辨率澎怒,直到一個聚類匹配的組織解剖結(jié)構(gòu)獲得。
聚類是跨所有 Visium 部分/批次共同完成的阶牍,因此區(qū)域身份是直接可比的喷面。當(dāng)多個批次之間存在強烈的技術(shù)效應(yīng)時(這里不是這種情況) ,sc.external.pp.bbknn
原則上可以用來解釋 KNN 構(gòu)建過程中的這些效應(yīng)走孽。
生成的集群保存在 adata _ vis. obs [‘ region _ cluster’]
中
# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
n_neighbors = 15)
# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)
# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")
# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor': 'white',
'figure.figsize': [8, 8]}):
sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
color_map = 'RdPu', ncols = 2, legend_loc='on data',
legend_fontsize=20)
sc.pl.umap(adata_vis, color=['sample'], size=30,
color_map = 'RdPu', ncols = 2,
legend_fontsize=20)
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
sc.pl.spatial(adata_vis, color=['region_cluster'],
size=1.3, img_key='hires', alpha=0.5)
利用基質(zhì)因子分解(NMF)識別細(xì)胞區(qū)間/組織區(qū)域
在這里惧辈,我們使用 cell2location 映射結(jié)果來確定細(xì)胞類型的空間共存,以便更好地理解組織組織和預(yù)測細(xì)胞相互作用磕瓷。我們用cell2location進行了細(xì)胞類型豐度的非負(fù)矩陣分解估計盒齿。與將 NMF 應(yīng)用于常規(guī) scRNA-seq 的有點相似念逞,加性 NMF 分解產(chǎn)生了一組空間細(xì)胞類型豐度分布圖,這些分布圖可以捕捉共同定位的細(xì)胞類型边翁。這種基于NMF 的分解自然地解釋了多種細(xì)胞類型和微環(huán)境可以在相同的 Visium 位置共存的事實 翎承,同時跨組織區(qū)域(例如個體生發(fā)中心)共享信息。
提示在實際應(yīng)用中符匾,最好對一系列因子 R = 5叨咖,..,30進行 NMF 訓(xùn)練啊胶。并選擇 R作為捕獲細(xì)小組織區(qū)域和分裂已知部分之間的平衡甸各。如果你想找到一些最獨特的細(xì)胞區(qū)間,使用小一點的R焰坪。如果你想找到非常強的同位信號趣倾,并假設(shè)大多數(shù)細(xì)胞類型不同位,使用大一點的R某饰。
下面我們將展示如何執(zhí)行此分析誊酌。為了幫助這個分析,我們把這個分析包裝成一個管道露乏,這個管道可以自動訓(xùn)練不同R的 NMF 模型:
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
adata_vis,
model_name='CoLocatedGroupsSklearnNMF',
train_args={
'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
'n_restarts': 3 # number of training restarts
},
export_args={'path': f'{run_name}/CoLocatedComb/'}
)
對于每個factor number碧浊,模型生成以下文件夾輸出列表:
cell_type_fractions_heatmap/
:a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)cell_type_fractions_mean/
:the data used for dot plotfactor_markers/
: tables listing top 10 cell types most speficic to each NMF factormodels/
:saved NMF modelspredictive_accuracy/
:2D histogram plot showing how well NMF explains cell2location outputspatial/
:NMF weights across locatinos in spatial coordinateslocation_factors_mean/
:the data used for the plot in spatial coordiantesstability_plots/
:stability of NMF weights between training restarts要檢查的關(guān)鍵輸出是
cell _ type _ fraction _ heatmap/
中的文件,其中顯示了 NMF 組件(列)中對應(yīng)于細(xì)胞間隔的各種類型(行)的估計 NMF 權(quán)重的點圖瘟仿。顯示的是相對權(quán)重箱锐,每個單元類型的組件之間的規(guī)范化。
# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()
高級應(yīng)用
估計空間數(shù)據(jù)中每個基因的細(xì)胞類型特異性表達
為此劳较,我們采用了由 Cable 等人提出的估計條件期望表達式的方法驹止。使用 cell2location,我們可以查看后驗概率分布观蜗,而不僅僅是細(xì)胞型特定表達式的估計值
# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
mod.samples["post_sample_q05"], mod.adata
)
# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
adata_vis.layers[n] = expected_dict['mu'][i]
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']
with mpl.rc_context({'axes.facecolor': 'black'}):
# select one slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
from tutorial_utils import plot_genes_per_cell_type
plot_genes_per_cell_type(slide, genes, ctypes);
與后驗概率計算一起臊恋,計算任意分位數(shù)
# Get posterior distribution samples for specific variables
samples_w_sf = mod.sample_posterior(num_samples=1000, use_gpu=True, return_samples=True,
batch_size=2020,
return_sites=['w_sf', 'm_g', 'u_sf_mRNA_factors'])
# samples_w_sf['posterior_samples'] contains 1000 samples as arrays with dim=(num_samples, ...)
samples_w_sf['posterior_samples']['w_sf'].shape
# Compute any quantile of the posterior distribution
medians = mod.posterior_quantile(q=0.5, use_gpu = True)
with mpl.rc_context({'axes.facecolor': 'white',
'figure.figsize': [5, 5]}):
plt.scatter(medians['w_sf'].flatten(), mod.samples['post_sample_means']['w_sf'].flatten());
plt.xlabel('median');
plt.ylabel('mean');
參考:https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html