到了年根了胡本,其實想總結(jié)一些之前的東西,其中cell2location很想再次分享一下兢哭,好好研讀一下庶香。
Cell2location 是一種有principled Bayesian model,可以解析空間轉(zhuǎn)錄組數(shù)據(jù)中的 fine-grained細胞類型泛源,并創(chuàng)建不同組織的綜合細胞圖拔妥。
Given cell type annotation for each cell, the corresponding reference cell type signatures (g_{f,g}), which represent the average mRNA count of each gene (g) in each cell type (f), can be estimated from sc/snRNA-seq data using 2 provided methods (see below). Cell2location needs untransformed unnormalised spatial mRNA counts as input. You also need to provide cell2location with the expected average cell abundance per location which is used as a prior to guide estimation of absolute cell abundance. This value depends on the tissue and can be estimated by counting nuclei for a few locations in the paired histology image but can be approximate (see paper methods for more guidance).
provide 2 methods for estimating reference cell type signatures from scRNA-seq data:
- 一種基于負二項式回歸的統(tǒng)計方法。 通常建議使用 NB 回歸达箍,它允許跨技術(shù)和批次穩(wěn)健地組合數(shù)據(jù)没龙,從而提高空間映射精度。
- 單個基因的每個cluster平均 mRNA 計數(shù)的硬編碼計算 (cell2location.cluster_averages.compute_cluster_averages)缎玫。 當(dāng)批次效應(yīng)較小時硬纤,這種更快的硬編碼計算每個集群平均值的方法提供了類似的高準(zhǔn)確度。
代碼部分也很值得多多學(xué)習(xí)
Loading packages
import sys
#if branch is stable, will install via pypi, else will install from source
branch = "github"
user = "BayraktarLab"
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and branch == "stable":
!pip install --quiet cell2location
elif IN_COLAB and branch != "stable":
!pip install --quiet --upgrade jsonschema
!pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cell2location
import scvi
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns
First, let’s define where we save the results of our analysis:
results_folder = './results/lymph_nodes_analysis/'
# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'
- 這里的用法我很喜歡赃磨, f'{results_folder}/reference_signatures'
Loading Visium and scRNA-seq reference data
首先從 10X Space Ranger 輸出中讀取空間 Visium 數(shù)據(jù)筝家。 可以使用 scanpy 方便地下載和導(dǎo)入此數(shù)據(jù)集。
adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None
- 這里對scanpy讀入的對象的信息添加很值得學(xué)習(xí)邻辉。
注意溪王! 線粒體編碼的基因(基因名稱以前綴 mt- 或 MT- 開頭)與空間映射無關(guān),因為它們的表達代表了單細胞和細胞核數(shù)據(jù)中的技術(shù)產(chǎn)物值骇,而不是線粒體的生物學(xué)豐度莹菱。 然而,這些基因在每個位置構(gòu)成了 15-40% 的 mRNA吱瘩。 因此道伟,為了避免映射偽影,我們強烈建議去除線粒體基因使碾。
# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]
# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
在單細胞參考中包含了跨越淋巴結(jié)蜜徽、脾臟和扁桃體的 scRNA-seq 數(shù)據(jù)集裹芝,以確保我們捕獲了空間轉(zhuǎn)錄組數(shù)據(jù)集中可能存在的免疫細胞狀態(tài)的全部多樣性。
# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
!cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
# Read data
adata_ref = sc.read(f'./data/sc.h5ad')
# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None
# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)
# filter the object
adata_ref = adata_ref[:, selected].copy()
Estimation of reference cell type signatures (NB regression)
The signatures are estimated from scRNA-seq data, accounting for batch effect, using a Negative binomial regression model.(這里需要考慮批次效應(yīng))
# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
# 10X reaction / sample / batch
batch_key='Sample',
# cell type, covariate used for constructing signatures
labels_key='Subset',
# multiplicative technical effects (platform, 3' vs 5', donor effect)
categorical_covariate_keys=['Method']
)
scvi.data.view_anndata_setup(adata_ref)
# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)
# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)
# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)
# Save model
mod.save(f"{ref_run_name}", overwrite=True)
# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)
Examine QC plots
- 1娜汁、重建準(zhǔn)確性以評估推理是否存在任何問題。
- 2兄朋、由于批次效應(yīng)掐禁,估計的表達特征不同于每個cluster中的平均表達。 對于不受批效應(yīng)影響的 scRNA-seq 數(shù)據(jù)集(該數(shù)據(jù)集有)颅和,可以使用聚類平均表達而不是用模型估計特征傅事。 當(dāng)此圖與對角線圖非常不同時(例如,Y 軸上的值非常低峡扩,到處都是密度)蹭越,則表明特征估計存在問題。
mod.plot_QC()
The model and output h5ad can be loaded later like this:
mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]
Cell2location: spatial mapping
# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()
# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)
Note! While you can often use the default value of detection_alpha hyperparameter, it is useful to adapt the expected cell abundance N_cells_per_location to every tissue. This value can be estimated from paired histology images and as described in the note above. Change the value presented in this tutorial (N_cells_per_location=30) to the value observed in your your tissue.
# create and train the model
mod = cell2location.models.Cell2location(
adata_vis, cell_state_df=inf_aver,
# the expected average cell abundance: tissue-dependent
# hyper-prior which can be estimated from paired histology:
N_cells_per_location=30,
# hyperparameter controlling normalisation of
# within-experiment variation in RNA detection (using default here):
detection_alpha=200
)
mod.train(max_epochs=30000,
# train using full data (batch_size=None)
batch_size=None,
# use all data points in training because
# we need to estimate cell abundance at all locations
train_size=1,
use_gpu=True)
# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)
# Save model
mod.save(f"{run_name}", overwrite=True)
# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
Visualising cell abundance in spatial coordinates
# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
sc.pl.spatial(slide, cmap='magma',
# show first 8 cell types
color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC',
'B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
ncols=4, size=1.3,
img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2'
)
# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial
# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(
adata=slide,
# labels to show on a plot
color=clust_col, labels=clust_labels,
show_img=True,
# 'fast' (white background) or 'dark_background'
style='fast',
# limit color scale at 99.2% quantile of cell abundance
max_color_quantile=0.992,
# size of locations (adjust depending on figure size)
circle_diameter=6,
colorbar_position='right'
)
Downstream analysis
Identifying discrete tissue regions by Leiden clustering
通過使用由 cell2location 估計的細胞豐度對位置進行聚類來識別細胞組成不同的組織區(qū)域骂维。
我們通過使用每種細胞類型的估計細胞豐度對 Visium 點進行聚類來找到組織區(qū)域摇锋。我們構(gòu)建了一個 K-nearest neigbour (KNN) 圖奋献,表示估計細胞豐度中位置的相似性,然后應(yīng)用 Leiden 聚類买置。 KNN 鄰居的數(shù)量應(yīng)適應(yīng)數(shù)據(jù)集的大小和解剖學(xué)定義區(qū)域的大小(即海馬區(qū)域相當(dāng)小强霎,因此可能被大型 n_neighbors 掩蓋)忿项。這可以針對范圍 KNN 鄰居和 Leiden 聚類分辨率完成,直到獲得與組織解剖結(jié)構(gòu)匹配的聚類城舞。
聚類是在所有 Visium 部分/批次中聯(lián)合完成的轩触,因此區(qū)域身份是直接可比的。當(dāng)多個批次之間存在很強的技術(shù)影響時(這里不是這種情況)家夺,原則上可以使用 sc.external.pp.bbknn 來解釋 KNN 構(gòu)建過程中的這些影響脱柱。
The resulting clusters are saved in adata_vis.obs['region_cluster'].
# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
n_neighbors = 15)
# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)
# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")
# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor': 'white',
'figure.figsize': [8, 8]}):
sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
color_map = 'RdPu', ncols = 2, legend_loc='on data',
legend_fontsize=20)
sc.pl.umap(adata_vis, color=['sample'], size=30,
color_map = 'RdPu', ncols = 2,
legend_fontsize=20)
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
sc.pl.spatial(adata_vis, color=['region_cluster'],
size=1.3, img_key='hires', alpha=0.5)
Identifying cellular compartments / tissue zones using matrix factorisation (NMF)(這部分應(yīng)該是新的內(nèi)容)
在這里,我們使用 cell2location 映射結(jié)果來識別細胞類型的空間共現(xiàn)拉馋,以便更好地了解組織組織并預(yù)測細胞相互作用褐捻。 我們對來自 cell2location 的細胞類型豐度估計進行了非負矩陣分解(NMF)。 與將 NMF 應(yīng)用于傳統(tǒng) scRNA-seq 的既定好處類似椅邓,附加 NMF 分解產(chǎn)生了一組空間細胞類型豐度曲線柠逞,將其分組為捕獲共定位細胞類型的組件。 這種基于 NMF 的分解自然地解釋了這樣一個事實景馁,即多種細胞類型和微環(huán)境可以在相同的 Visium 位置共存板壮,同時跨組織區(qū)域(例如單個生發(fā)中心)共享信息。
提示 在實踐中合住,最好針對一系列因子 (R={5, .., 30}) 訓(xùn)練 NMF绰精,并選擇 (R) 作為捕獲精細組織區(qū)域和拆分已知區(qū)室之間的平衡撒璧。 如果您想找到幾個最明顯的細胞隔室,請使用少量因子笨使。 如果您想找到非常強的協(xié)同定位信號并假設(shè)大多數(shù)細胞類型不協(xié)同定位卿樱,請使用很多因子(> 30 - 此處使用)。
Below we show how to perform this analysis. To aid this analysis, we wrapped the analysis shown the notebook on advanced downstream analysis into a pipeline that automates training of the NMF model with varying number of factors:
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
adata_vis,
model_name='CoLocatedGroupsSklearnNMF',
train_args={
'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
'n_restarts': 3 # number of training restarts
},
export_args={'path': f'{run_name}/CoLocatedComb/'}
)
For every factor number, the model produces the following list of folder outputs:
- cell_type_fractions_heatmap/: a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)
- cell_type_fractions_mean/: the data used for dot plot
- factor_markers/: tables listing top 10 cell types most speficic to each NMF factor
- models/: saved NMF models
- predictive_accuracy/: 2D histogram plot showing how well NMF explains cell2location output
- spatial/: NMF weights across locatinos in spatial coordinates
- location_factors_mean/: the data used for the plot in spatial coordiantes
- stability_plots/: stability of NMF weights between training restarts
檢查的關(guān)鍵輸出是 cell_type_fractions_heatmap/ 中的文件硫椰,它們顯示了與細胞隔室相對應(yīng)的 NMF 組件(列)中細胞類型(行)的估計 NMF 權(quán)重的點圖繁调。 顯示的是相對權(quán)重,對每種細胞類型的組件進行了標(biāo)準(zhǔn)化靶草。
# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()
Advanced use
Estimate cell-type specific expression of every gene in the spatial data
For this, we adapt the approach of estimating conditional expected expression proposed by RCTD (Cable et al) method. With cell2location, we can look at the posterior distribution rather than just point estimates of cell type specific expression (see mod.samples.keys()
and next section on using full distribution).
# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
mod.samples["post_sample_q05"], mod.adata
)
# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
adata_vis.layers[n] = expected_dict['mu'][i]
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']
with mpl.rc_context({'axes.facecolor': 'black'}):
# select one slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
from tutorial_utils import plot_genes_per_cell_type
plot_genes_per_cell_type(slide, genes, ctypes);
就是想回顧一下蹄胰,當(dāng)然了,軟件也更新了很多內(nèi)容
生活很好奕翔,有你更好