又是周五翠订,馬上元旦,不知道大家打算怎么跨年呢遵倦?但是尽超,今天還是工作日,我們還是要學(xué)習(xí)學(xué)習(xí)梧躺,不負(fù)韶華似谁。
這一篇還是想詳細(xì)梳理一下cell2location傲绣,包括算法以及代碼的寫法,有很多學(xué)習(xí)的地方巩踏。
這里演示了如何使用 cell2location 模型將單個(gè)細(xì)胞參考細(xì)胞類型映射到空間轉(zhuǎn)錄組數(shù)據(jù)集秃诵。 在這里,使用 10X 單核 RNA 測(cè)序 (snRNAseq) 和從小鼠大腦的相鄰組織切片生成的 Visium 空間轉(zhuǎn)錄組數(shù)據(jù)(單核 + 空間)
Cell2location 是一種貝葉斯模型塞琼,它集成了單細(xì)胞 RNA-seq (scRNA-seq) 和多細(xì)胞空間轉(zhuǎn)錄組學(xué)菠净,以高效地繪制大型綜合細(xì)胞類型參考
第一部分,計(jì)算單細(xì)胞細(xì)胞類型的表達(dá)特征
第一步彪杉,從 scRNA-seq 譜中估計(jì)參考細(xì)胞類型特征毅往,例如使用常規(guī)聚類來識(shí)別細(xì)胞類型和亞群,然后估計(jì)平均cluster基因表達(dá)譜派近。 Cell2location 基于負(fù)二項(xiàng)式回歸實(shí)現(xiàn)了這個(gè)估計(jì)步驟攀唯,它允許跨技術(shù)和批次穩(wěn)健地組合數(shù)據(jù)(多個(gè)樣本還涉及到批次).
Loading packages
import sys
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import os
data_type = 'float32'
import cell2location
import matplotlib as mpl
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns
# silence scanpy that prints a lot of warnings
import warnings
warnings.filterwarnings('ignore')
Loading single cell reference data
使用小鼠大腦的配對(duì) Visium 和 snRNAseq 參考數(shù)據(jù)集(即從相鄰組織切片生成)。 該數(shù)據(jù)集由來自 2 只小鼠的 3 個(gè)切片的細(xì)胞組成构哺。 已經(jīng)注釋了多個(gè)大腦區(qū)域的 59 個(gè)細(xì)胞神經(jīng)元和膠質(zhì)細(xì)胞群革答,包括 10 個(gè)區(qū)域星形膠質(zhì)細(xì)胞亞型(單細(xì)胞的細(xì)胞定義必須先做好)。
sc_data_folder = './data/'
results_folder = './results/mouse_brain_snrna/'
if os.path.exists(sc_data_folder) is not True:
os.mkdir(sc_data_folder)
os.system(f'cd {sc_data_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/all_cells_20200625.h5ad')
os.system(f'cd {sc_data_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/snRNA_annotation_astro_subtypes_refined59_20200823.csv')
if os.path.exists(results_folder) is not True:
os.mkdir('./results')
os.mkdir(results_folder)
讀取單細(xì)胞數(shù)據(jù)和細(xì)胞注釋結(jié)果
## snRNA reference (raw counts)
adata_snrna_raw = anndata.read_h5ad(sc_data_folder + "all_cells_20200625.h5ad")
## Cell type annotations
labels = pd.read_csv(sc_data_folder + 'snRNA_annotation_astro_subtypes_refined59_20200823.csv', index_col=0)
Add cell type labels as columns in adata.obs
####reindex函數(shù)曙强,這是第一個(gè)需要注意的函數(shù)
labels = labels.reindex(index=adata_snrna_raw.obs_names)
adata_snrna_raw.obs[labels.columns] = labels
adata_snrna_raw = adata_snrna_raw[~adata_snrna_raw.obs['annotation_1'].isna(), :]
Reduce the number of genes by discarding lowly expressed genes(通過丟棄低表達(dá)基因來減少基因數(shù)量)
這是使用 2 個(gè)閾值執(zhí)行的残拐,以去除盡可能多的低表達(dá)基因,同時(shí)避免容易刪除稀有種群標(biāo)記的高度可變基因選擇 (HVG):
- 包括至少 3% 的細(xì)胞表達(dá)的所有基因 (cell_count_cutoff2)
- 包括由至少 0.05% 的細(xì)胞表達(dá)的基因 (cell_count_cutoff)碟嘴,當(dāng)它們?cè)诜橇慵?xì)胞 (nonz_mean_cutoff) 中具有高計(jì)數(shù)時(shí)
偏向于第二種選擇基因的方式溪食,因?yàn)榈?2 步允許保留由稀有細(xì)胞群表達(dá)但水平很高的基因,而標(biāo)準(zhǔn)的 HVG 選擇方法可以過濾掉這些基因娜扇,因?yàn)樗鼈兊娜志岛头讲钶^低错沃。
# remove cells and genes with 0 counts everywhere
sc.pp.filter_cells(adata_snrna_raw, min_genes=1)
sc.pp.filter_genes(adata_snrna_raw, min_cells=1)
# calculate the mean of each gene across non-zero cells
adata_snrna_raw.var['n_cells'] = (adata_snrna_raw.X.toarray() > 0).sum(0)
adata_snrna_raw.var['nonz_mean'] = adata_snrna_raw.X.toarray().sum(0) / adata_snrna_raw.var['n_cells']
plt.hist2d(np.log10(adata_snrna_raw.var['nonz_mean']),
np.log10(adata_snrna_raw.var['n_cells']), bins=100,
norm=mpl.colors.LogNorm(),
range=[[0,0.5], [1,4.5]]);
nonz_mean_cutoff = np.log10(1.12) # cut off for expression in non-zero cells
cell_count_cutoff = np.log10(adata_snrna_raw.shape[0] * 0.0005) # cut off percentage for cells with higher expression
cell_count_cutoff2 = np.log10(adata_snrna_raw.shape[0] * 0.03)# cut off percentage for cells with small expression
plt.vlines(nonz_mean_cutoff, cell_count_cutoff, cell_count_cutoff2, color = 'orange');
plt.hlines(cell_count_cutoff, nonz_mean_cutoff, 1, color = 'orange');
plt.hlines(cell_count_cutoff2, 0, nonz_mean_cutoff, color = 'orange');
plt.xlabel('Mean count in cells with mRNA count > 0 (log10)');
plt.ylabel('Count of cells with mRNA count > 0 (log10)');
Show the number of selected cells and genes:
adata_snrna_raw[:,(np.array(np.log10(adata_snrna_raw.var['nonz_mean']) > nonz_mean_cutoff)
| np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff2))
& np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff)].shape
###(40532, 12844)
Filter the object
# select genes based on mean expression in non-zero cells
adata_snrna_raw = adata_snrna_raw[:,(np.array(np.log10(adata_snrna_raw.var['nonz_mean']) > nonz_mean_cutoff)
| np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff2))
& np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff)
& np.array(~adata_snrna_raw.var['SYMBOL'].isna())]
Add counts matrix as adata.raw
adata_snrna_raw.raw = adata_snrna_raw
Show UMAP of cells(圖形展示)
可以通過使用標(biāo)準(zhǔn)的 scanpy 工作流程來檢查數(shù)據(jù)的細(xì)胞組成,以生成單個(gè)細(xì)胞數(shù)據(jù)的 UMAP 表示雀瓢。
這個(gè)地方要注意一點(diǎn)枢析,變異度最大的第一個(gè)PC軸被去掉了;去除批次效應(yīng)采用的是bbknn
#########################
adata_snrna_raw.X = adata_snrna_raw.raw.X.copy()
sc.pp.log1p(adata_snrna_raw)
sc.pp.scale(adata_snrna_raw, max_value=10)
sc.tl.pca(adata_snrna_raw, svd_solver='arpack', n_comps=80, use_highly_variable=False)
# Plot total counts over PC to check whether PC is indeed associated with total counts
#sc.pl.pca_variance_ratio(adata_snrna_raw, log=True)
#sc.pl.pca(adata_snrna_raw, color=['total_counts'],
# components=['0,1', '2,3', '4,5', '6,7', '8,9', '10,11', '12,13'],
# color_map = 'RdPu', ncols = 3, legend_loc='on data',
# legend_fontsize=10, gene_symbols='SYMBOL')
# remove the first PC which explains large amount of variance in total UMI count (likely technical variation)
adata_snrna_raw.obsm['X_pca'] = adata_snrna_raw.obsm['X_pca'][:, 1:]
adata_snrna_raw.varm['PCs'] = adata_snrna_raw.varm['PCs'][:, 1:]
#########################
# Here BBKNN (https://github.com/Teichlab/bbknn) is used to align batches (10X experiments)
import bbknn
bbknn.bbknn(adata_snrna_raw, neighbors_within_batch = 3, batch_key = 'sample', n_pcs = 79)
sc.tl.umap(adata_snrna_raw, min_dist = 0.8, spread = 1.5)
#########################
adata_snrna_raw = adata_snrna_raw[adata_snrna_raw.obs['annotation_1'].argsort(),:]
with mpl.rc_context({'figure.figsize': [10, 10],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_snrna_raw, color=['annotation_1'], size=15,
color_map = 'RdPu', ncols = 1, legend_loc='on data',
legend_fontsize=10)
Estimating expression signatures
模型的簡(jiǎn)單介紹
Model-based estimation of reference expression signatures of cell types :math:g_{f,g}
using a regularised Negative Binomial regression. This model robustly derives reference expression signatures of cell types (g_{f,g}) using the data composed of multiple batches (e={1..E}) and technologies (t={1..T}). Adapting the assumptions of a range of computational methods for scRNA-seq, we assume that the expression count matrix follows a Negative Binomial distribution with unobserved expression levels (rates) (\mu_{c,g}) and a gene-specific over-dispersion (\alpha_g). We model (\mu_{c,g}) as a linear function of reference cell type signatures and technical effects: - (e_e) denotes a multiplicative global scaling parameter between experiments/batches (e) (e.g. differences in sequencing depth); - (t_{t,g}) accounts for multiplicative gene-specific difference in sensitivity between technologies; - (b_{e,g}) accounts for additive background shift of each gene in each experiment (e) (proxy for free-floating RNA).
Training the model(訓(xùn)練模型)
這里展示如何執(zhí)行包裝到單個(gè)管道函數(shù)調(diào)用中的此模型的訓(xùn)練刃麸,如何評(píng)估此模型的質(zhì)量并提取細(xì)胞類型的參考簽名以與 cell2location 一起使用: (這些參數(shù)很值得深入探討一下)
# Run the pipeline:
from cell2location import run_regression
r, adata_snrna_raw = run_regression(adata_snrna_raw, # input data object]
verbose=True, return_all=True,
train_args={
'covariate_col_names': ['annotation_1'], # column listing cell type annotation
'sample_name_col': 'sample', # column listing sample ID for each cell
# column listing technology, e.g. 3' vs 5',
# when integrating multiple single cell technologies corresponding
# model is automatically selected
'tech_name_col': None,
'stratify_cv': 'annotation_1', # stratify cross-validation by cell type annotation
'n_epochs': 100, 'minibatch_size': 1024, 'learning_rate': 0.01,
'use_cuda': True, # use GPU?
'train_proportion': 0.9, # proportion of cells in the training set (for cross-validation)
'l2_weight': True, # uses defaults for the model
'readable_var_name_col': 'SYMBOL', 'use_raw': True},
model_kwargs={}, # keep defaults
posterior_args={}, # keep defaults
export_args={'path': results_folder + 'regression_model/', # where to save results
'save_model': True, # save pytorch model?
'run_name_suffix': ''})
reg_mod = r['mod']
Saved anndata object and the trained model object can be read later using
reg_mod_name = 'RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes'
reg_path = f'{results_folder}regression_model/{reg_mod_name}/'
## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
## model
r = pickle.load(file = open(f'{reg_path}model_.p', "rb"))
reg_mod = r['mod']
Export reference expression signatures of cell types(導(dǎo)出細(xì)胞類型的參考表達(dá)特征),re的用法我們也要好好學(xué)習(xí)一下
# Export cell type expression signatures:
covariate_col_names = 'annotation_1'
inf_aver = adata_snrna_raw.raw.var.copy()
inf_aver = inf_aver.loc[:, [f'means_cov_effect_{covariate_col_names}_{i}' for i in adata_snrna_raw.obs[covariate_col_names].unique()]]
from re import sub
inf_aver.columns = [sub(f'means_cov_effect_{covariate_col_names}_{i}', '', i) for i in adata_snrna_raw.obs[covariate_col_names].unique()]
inf_aver = inf_aver.iloc[:, inf_aver.columns.argsort()]
# scale up by average sample scaling factor
inf_aver = inf_aver * adata_snrna_raw.uns['regression_mod']['post_sample_means']['sample_scaling'].mean()
將估計(jì)的特征(y 軸)與分析計(jì)算的平均表達(dá)(x 軸)進(jìn)行比較:
# compute mean expression of each gene in each cluster
aver = cell2location.cluster_averages.cluster_averages.get_cluster_averages(adata_snrna_raw, covariate_col_names)
aver = aver.loc[adata_snrna_raw.var_names, inf_aver.columns]
# compare estimated signatures (y-axis) to analytically computed mean expression (x-axis)
with mpl.rc_context({'figure.figsize': [5, 5]}):
plt.hist2d(np.log10(aver.values.flatten()+1), np.log10(inf_aver.values.flatten()+1),
bins=50, norm=mpl.colors.LogNorm());
plt.xlabel('Mean expression in each cluster');
plt.ylabel('Inferred expression in each cluster');
評(píng)估估計(jì)的特征是否因?yàn)榛煜龢颖颈尘耙驯灰瞥档拖嚓P(guān)性:
# Look at how correlated are the signatures obtained by computing mean expression
with mpl.rc_context({'figure.figsize': [5, 5]}):
reg_mod.align_plot_stability(aver, aver, 'cluster_average', 'cluster_average', align=False)
# Look at how correlated are the signatures inferred by regression model - they should be less correlated than above
with mpl.rc_context({'figure.figsize': [5, 5]}):
reg_mod.align_plot_stability(inf_aver, inf_aver, 'inferred_signature', 'inferred_signature', align=False)
將每個(gè)實(shí)驗(yàn)的細(xì)胞計(jì)數(shù)與估計(jì)的背景(湯醒叁、自由漂浮的 RNA)進(jìn)行比較:
# Examine how many mRNA per cell on average are background
sample_name_col = 'sample'
cell_count = adata_snrna_raw.obs[sample_name_col].value_counts()
cell_count.index = [f'means_sample_effect{sample_name_col}_{i}' for i in cell_count.index]
soup_amount = reg_mod.sample_effects.sum(0)
with mpl.rc_context({'figure.figsize': [5, 5]}):
plt.scatter(cell_count[soup_amount.index].values.flatten(),
soup_amount.values.flatten());
plt.xlabel('Cell count per sample'); # fraction of reads in cells
plt.ylabel('Inferred sum of sample effects');
Additional quality control: removing technical effects and performing standard scanpy single cell analysis workflow(去除批次效應(yīng))
這允許通過檢查從每個(gè)單個(gè)細(xì)胞中刪除這些因素是否會(huì)導(dǎo)致合并樣本/批次來確定模型是否成功地考慮了技術(shù)因素,同時(shí)在 UMAP 空間中保留了分離良好的細(xì)胞類型泊业。這里關(guān)注一下函數(shù)del把沼,del刪除的是變量,而不是數(shù)據(jù)吁伺。當(dāng)然這里的教程主要還是有一些注意的地方饮睬,remove the first PC which explains large amount of variance in total UMI count (首個(gè)PC都去掉了)。
adata_snrna_raw_cor = adata_snrna_raw.copy()
del adata_snrna_raw_cor.uns['log1p']
adata_snrna_raw_cor.X = np.array(reg_mod.normalise(adata_snrna_raw_cor.raw.X.copy()))
sc.pp.log1p(adata_snrna_raw_cor)
sc.pp.scale(adata_snrna_raw_cor, max_value=10)
# when all RNA of a given gene are additive background this results in NaN after scaling
adata_snrna_raw_cor.X[np.isnan(adata_snrna_raw_cor.X)] = 0
sc.tl.pca(adata_snrna_raw_cor, svd_solver='arpack', n_comps=80, use_highly_variable=False)
adata_snrna_raw.obs['total_counts'] = np.array(adata_snrna_raw.raw.X.sum(1)).flatten()
adata_snrna_raw_cor.obs['total_counts'] = adata_snrna_raw.obs['total_counts'].values.copy()
sc.pl.pca(adata_snrna_raw_cor, color=['total_counts'],
components=['1,2'],
color_map = 'RdPu', ncols = 2, legend_loc='on data', vmax='p99.9',
legend_fontsize=10)
# remove the first PC which explains large amount of variance in total UMI count (likely technical variation)
adata_snrna_raw_cor.obsm['X_pca'] = adata_snrna_raw_cor.obsm['X_pca'][:, 1:]
adata_snrna_raw_cor.varm['PCs'] = adata_snrna_raw_cor.varm['PCs'][:, 1:]
# here we use standard neighbors function rather than bbknn
# to show that the regression model can merge batches / experiments
sc.pp.neighbors(adata_snrna_raw_cor, n_neighbors = 15, n_pcs = 79, metric='cosine')
sc.tl.umap(adata_snrna_raw_cor, min_dist = 0.8, spread = 1)
with mpl.rc_context({'figure.figsize': [7, 7],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_snrna_raw_cor, color=['annotation_1', 'sample', 'total_counts'],
color_map = 'RdPu', ncols = 1, size=13, #legend_loc='on data',
legend_fontsize=10, palette=sc.pl.palettes.default_102)
第二部分篮奄,Spatial mapping of cell types across the mouse brain (2/3) - cell2location
Loading packages and setting up GPU
首先则吟,我們需要加載相關(guān)的包并告訴cell2location使用GPU。 cell2location 是用 pymc3 語言編寫的隅俘,用于概率建模,它使用名為 theano 的深度學(xué)習(xí)庫進(jìn)行大量計(jì)算劫拗。 雖然該包適用于 GPU 和 CPU间校,但使用 GPU 可顯著縮短 10X Visium 數(shù)據(jù)集的計(jì)算時(shí)間矾克。 對(duì)于空間位置較少的較小數(shù)據(jù)集(例如 Nanostring WTA 技術(shù)),使用 CPU 更可行憔足。
import sys
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import os
import gc
# this line forces theano to use the GPU and should go before importing cell2location
os.environ["THEANO_FLAGS"] = 'device=cuda0,floatX=float32,force_device=True'
# if using the CPU uncomment this:
#os.environ["THEANO_FLAGS"] = 'device=cpu,floatX=float32,openmp=True,force_device=True'
import cell2location
import matplotlib as mpl
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns
# silence scanpy that prints a lot of warnings
import warnings
warnings.filterwarnings('ignore')
Loading Visium data
首先胁附,需要從數(shù)據(jù)門戶下載并解壓縮空間數(shù)據(jù),以及下載參考細(xì)胞類型的注釋文件:
# Set paths to data and results used through the document:
sp_data_folder = './data/mouse_brain_visium_wo_cloupe_data/'
results_folder = './results/mouse_brain_snrna/'
regression_model_output = 'RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes'
reg_path = f'{results_folder}regression_model/{regression_model_output}/'
# Download and unzip spatial data
if os.path.exists('./data') is not True:
os.mkdir('./data')
os.system('cd ./data && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_wo_cloupe_data.zip')
os.system('cd ./data && unzip mouse_brain_visium_wo_cloupe_data.zip')
# Download and unzip snRNA-seq data with signatures of reference cell types
# (if the output folder was not created by tutorial 1/3)
if os.path.exists(reg_path) is not True:
os.mkdir('./results')
os.mkdir(f'{results_folder}')
os.mkdir(f'{results_folder}regression_model')
os.mkdir(f'{reg_path}')
os.system(f'cd {reg_path} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/regression_model/RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes/sc.h5ad')
現(xiàn)在滓彰,從 10X Space Ranger 輸出中讀取空間 Visium 數(shù)據(jù)并檢查幾個(gè) QC 圖控妻。 在這里,將 Visium 小鼠大腦實(shí)驗(yàn)(即切片)和相應(yīng)的組織學(xué)圖像加載到單個(gè) anndata 對(duì)象 adata 中揭绑。
定義函數(shù)的用法多學(xué)學(xué)弓候,別定義的那么難看,??
def read_and_qc(sample_name, path=sp_data_folder + 'rawdata/'):
r""" This function reads the data for one 10X spatial experiment into the anndata object.
It also calculates QC metrics. Modify this function if required by your workflow.
:param sample_name: Name of the sample
:param path: path to data
"""
adata = sc.read_visium(path + str(sample_name),
count_file='filtered_feature_bc_matrix.h5', load_images=True)
adata.obs['sample'] = sample_name
adata.var['SYMBOL'] = adata.var_names
adata.var.rename(columns={'gene_ids': 'ENSEMBL'}, inplace=True)
adata.var_names = adata.var['ENSEMBL']
adata.var.drop(columns='ENSEMBL', inplace=True)
# Calculate QC metrics
from scipy.sparse import csr_matrix
adata.X = adata.X.toarray()
sc.pp.calculate_qc_metrics(adata, inplace=True)
adata.X = csr_matrix(adata.X)
adata.var['mt'] = [gene.startswith('mt-') for gene in adata.var['SYMBOL']]
adata.obs['mt_frac'] = adata[:, adata.var['mt'].tolist()].X.sum(1).A.squeeze()/adata.obs['total_counts']
# add sample name to obs names
adata.obs["sample"] = [str(i) for i in adata.obs['sample']]
adata.obs_names = adata.obs["sample"] \
+ '_' + adata.obs_names
adata.obs.index.name = 'spot_id'
return adata
def select_slide(adata, s, s_col='sample'):
r""" This function selects the data for one slide from the spatial anndata object.
:param adata: Anndata object with multiple spatial experiments
:param s: name of selected experiment
:param s_col: column in adata.obs listing experiment name for each location
"""
slide = adata[adata.obs[s_col].isin([s]), :]
s_keys = list(slide.uns['spatial'].keys())
s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]
slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}
return slide
#######################
# Read the list of spatial experiments
sample_data = pd.read_csv(sp_data_folder + 'Visium_mouse.csv')
# Read the data into anndata objects
slides = []
for i in sample_data['sample_name']:
slides.append(read_and_qc(i, path=sp_data_folder + 'rawdata/'))
# Combine anndata objects together
adata = slides[0].concatenate(
slides[1:],
batch_key="sample",
uns_merge="unique",
batch_categories=sample_data['sample_name'],
index_unique=None
)
#######################
注意他匪! 線粒體編碼的基因(基因名稱以前綴 mt- 或 MT- 開頭)與空間映射無關(guān)菇存,因?yàn)樗鼈兊谋磉_(dá)代表了單細(xì)胞和細(xì)胞核數(shù)據(jù)中的技術(shù)產(chǎn)物,而不是線粒體的生物學(xué)豐度邦蜜。 然而依鸥,這些基因在每個(gè)位置構(gòu)成了 15-40% 的 mRNA。 因此悼沈,為了避免映射偽影贱迟,我們強(qiáng)烈建議去除線粒體基因。
# mitochondria-encoded (MT) genes should be removed for spatial mapping
adata.obsm['mt'] = adata[:, adata.var['mt'].values].X.toarray()
adata = adata[:, ~adata.var['mt'].values]###這個(gè)方法不錯(cuò)
Look at QC metrics
現(xiàn)在讓我們看看 QC:Visium 實(shí)驗(yàn)中每個(gè)位置的總計(jì)數(shù)和基因總數(shù)絮供。
python的enumerate函數(shù)衣吠,也是值得學(xué)習(xí)的
# PLOT QC FOR EACH SAMPLE
fig, axs = plt.subplots(len(slides), 4, figsize=(15, 4*len(slides)-4))
for i, s in enumerate(adata.obs['sample'].unique()):
#fig.suptitle('Covariates for filtering')
slide = select_slide(adata, s)
sns.distplot(slide.obs['total_counts'],
kde=False, ax = axs[i, 0])
axs[i, 0].set_xlim(0, adata.obs['total_counts'].max())
axs[i, 0].set_xlabel(f'total_counts | {s}')
sns.distplot(slide.obs['total_counts']\
[slide.obs['total_counts']<20000],
kde=False, bins=40, ax = axs[i, 1])
axs[i, 1].set_xlim(0, 20000)
axs[i, 1].set_xlabel(f'total_counts | {s}')
sns.distplot(slide.obs['n_genes_by_counts'],
kde=False, bins=60, ax = axs[i, 2])
axs[i, 2].set_xlim(0, adata.obs['n_genes_by_counts'].max())
axs[i, 2].set_xlabel(f'n_genes_by_counts | {s}')
sns.distplot(slide.obs['n_genes_by_counts']\
[slide.obs['n_genes_by_counts']<6000],
kde=False, bins=60, ax = axs[i, 3])
axs[i, 3].set_xlim(0, 6000)
axs[i, 3].set_xlabel(f'n_genes_by_counts | {s}')
plt.tight_layout()
Visualise Visium data in spatial 2D and UMAP coordinates
Visualising data in spatial coordinates with scanpy
Next, we show how to plot these QC values over the histology image using standard scanpy tools
slide = select_slide(adata, 'ST8059048')
with mpl.rc_context({'figure.figsize': [6,7],
'axes.facecolor': 'white'}):
sc.pl.spatial(slide, img_key = "hires", cmap='magma',
library_id=list(slide.uns['spatial'].keys())[0],
color=['total_counts', 'n_genes_by_counts'], size=1,
gene_symbols='SYMBOL', show=False, return_fig=True)
Here we show how to use scanpy to plot the expression of individual genes without the histology image.
with mpl.rc_context({'figure.figsize': [6,7],
'axes.facecolor': 'black'}):
sc.pl.spatial(slide,
color=["Rorb", "Vip"], img_key=None, size=1,
vmin=0, cmap='magma', vmax='p99.0',
gene_symbols='SYMBOL'
)
Add counts matrix as adata.raw
adata_vis = adata.copy()
adata_vis.raw = adata_vis
######## Select two Visium sections to speed up the analysis
選擇兩個(gè) Visium 部分,也稱為實(shí)驗(yàn)/批次壤靶,以加快分析速度缚俏,每個(gè)生物復(fù)制一個(gè)。
s = ['ST8059048', 'ST8059052']
adata_vis = adata_vis[adata_vis.obs['sample'].isin(s),:]
Construct and examine UMAP of locations
現(xiàn)在萍肆,我們將標(biāo)準(zhǔn)的掃描處理pipeline應(yīng)用于空間 Visium 數(shù)據(jù)袍榆,以顯示實(shí)驗(yàn)數(shù)據(jù)中的可變性。 重要的是塘揣,此工作流程將顯示數(shù)據(jù)中批次差異的程度.
在這個(gè)小鼠大腦數(shù)據(jù)集中包雀,切片之間只有少數(shù)區(qū)域應(yīng)該不同,因?yàn)槲覀兪褂昧藖碜陨飶?fù)制品的 2 個(gè)樣本亲铡,這些樣本在小鼠大腦中沿前后軸的位置略有不同才写。 我們從兩個(gè)實(shí)驗(yàn)和一些不匹配中看到了位置的一般對(duì)齊葡兑,這里的實(shí)驗(yàn)之間的大部分差異來自批次效應(yīng),cell2location 可以解釋這一點(diǎn)赞草。
adata_vis_plt = adata_vis.copy()
# Log-transform (log(data + 1))
sc.pp.log1p(adata_vis_plt)
# Find highly variable genes within each sample
adata_vis_plt.var['highly_variable'] = False
for s in adata_vis_plt.obs['sample'].unique():
adata_vis_plt_1 = adata_vis_plt[adata_vis_plt.obs['sample'].isin([s]), :]
sc.pp.highly_variable_genes(adata_vis_plt_1, min_mean=0.0125, max_mean=5, min_disp=0.5, n_top_genes=1000)
hvg_list = list(adata_vis_plt_1.var_names[adata_vis_plt_1.var['highly_variable']])
adata_vis_plt.var.loc[hvg_list, 'highly_variable'] = True
# Scale the data ( (data - mean) / sd )
sc.pp.scale(adata_vis_plt, max_value=10)
# PCA, KNN construction, UMAP
sc.tl.pca(adata_vis_plt, svd_solver='arpack', n_comps=40, use_highly_variable=True)
sc.pp.neighbors(adata_vis_plt, n_neighbors = 20, n_pcs = 40, metric='cosine')
sc.tl.umap(adata_vis_plt, min_dist = 0.3, spread = 1)
with mpl.rc_context({'figure.figsize': [8, 8],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_vis_plt, color=['sample'], size=30,
color_map = 'RdPu', ncols = 1, #legend_loc='on data',
legend_fontsize=10)
Load reference cell type signature from snRNA-seq data and show UMAP of cells
接下來讹堤,我們加載預(yù)處理過的 snRNAseq 參考 anndata 對(duì)象,其中包含參考細(xì)胞類型的估計(jì)表達(dá)特征
## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
Export reference expression signatures of cell types
# Column name containing cell type annotations
covariate_col_names = 'annotation_1'
# Extract a pd.DataFrame with signatures from anndata object
inf_aver = adata_snrna_raw.raw.var.copy()
inf_aver = inf_aver.loc[:, [f'means_cov_effect_{covariate_col_names}_{i}' for i in adata_snrna_raw.obs[covariate_col_names].unique()]]
from re import sub
inf_aver.columns = [sub(f'means_cov_effect_{covariate_col_names}_{i}', '', i) for i in adata_snrna_raw.obs[covariate_col_names].unique()]
inf_aver = inf_aver.iloc[:, inf_aver.columns.argsort()]
# normalise by average experiment scaling factor (corrects for sequencing depth)
inf_aver = inf_aver * adata_snrna_raw.uns['regression_mod']['post_sample_means']['sample_scaling'].mean()
Quick look at the cell type composition in our reference data in UMAP coordinates
with mpl.rc_context({'figure.figsize': [10, 10],
'axes.facecolor': 'white'}):
sc.pl.umap(adata_snrna_raw, color=['annotation_1'], size=15,
color_map = 'RdPu', ncols = 1, legend_loc='on data',
legend_fontsize=10)
Cell2location model description and analysis pipeline
Cell2location 被實(shí)現(xiàn)為一個(gè)可解釋的分層貝葉斯模型厨疙,從而 (1) 提供了解釋模型不確定性的原則方法洲守; (2) 考慮細(xì)胞類型豐度的線性相關(guān)性,(3) 對(duì)跨技術(shù)測(cè)量靈敏度的差異進(jìn)行建模沾凄,以及 (4) 通過采用靈活的基于計(jì)數(shù)的誤差模型來考慮無法解釋的/殘留變化梗醇。 最后,(5)cell2location 支持多個(gè)實(shí)驗(yàn)/批次的聯(lián)合建模撒蟀。
Brief description of the model
Briefly, cell2location is a Bayesian model, which estimates absolute cell density of cell types by decomposing mRNA counts (d_{s,g}) of each gene (g={1, .., G}) at locations (s={1, .., S}) into a set of predefined reference signatures of cell types (g_{fg}). Joint modelling mode works across experiments (e={1,..,E}), such as 10X Visium chips (i.e. square capture areas) and Slide-Seq V2 pucks (i.e. beads).
Cell2location models the elements of (d_{s,g}) as Negative Binomial distributed, given an unobserved rate (\mu_{s,g}) and a gene-specific over-dispersion parameter (\alpha_{eg}):
[\begin{split}D_{s,g} \sim \mathtt{NB}(\mu_{s,g}, \alpha_{eg}) \\end{split}]
The expression level of genes (\mu_{s,g}) in the mRNA count space is modelled as a linear function of expression signatures of reference cell types:
[\mu_{s,g} = \underbrace{m_{g}}{\text{technology sensitivity}} \cdot \underbrace{\left (\sum{f} {w_{s,f} : g_{f,g}} \right)}{\text{cell type contributions}} + \underbrace{l_s + s{eg}}_{\text{additive shift}}]
where, (w_{s,f}) denotes regression weight of each reference signature (f) at location (s), which can be interpreted as the number of cells at location (s) that express reference signature (f); (m_{g}) a gene-specific scaling parameter, which adjusts for global differences in sensitivity between technologies; (l_s) and (s_{eg}) are additive variables that account for gene- and location-specific shift, such as due to contaminating or free-floating RNA.
To account for the similarity of location patterns across cell types, (w_{s,f}) is modelled using another layer of decomposition (factorization) using (r={1, .., R}) groups of cell types, that can be interpreted as cellular compartments or tissue zones (Suppl. Methods). Unless stated otherwise, (R) is set to 50.
Selecting hyper-parameters
Note! While the scaling parameter (m_{g}) facilitates the integration across technologies, it leads to non-identifiability between (m_{g}) and (w_{s,f}), unless the informative priors on both variables are used. To address this, we employ informative prior distributions on (w_{s,f}) and (m_{g}), which are controlled by 4 used-provided hyper-parameters. For guidance on selecting these hyper-parameters see below and Suppl. Methods (Section 1.3).
For the mouse brain we suggest using the following values for 4 used-provided hyper-parameters: 1. (\hat{N} = 8), the expected number of cells per location, estimated based on comparison to histology image; 2. (\hat{A} = 9), the expected number of cell types per location, assuming that most cells in a given location belong to a different type and that many locations contain cell processes rather than complete cells; 3. (\hat{Y} = 5), the expected number of co-located cell type groups per location, assuming that very few cell types have linearly dependent abundance patterns, except for the regional astrocytes and corresponding neurons such that on average about 2 cell types per group are expected (\hat{A}/\hat{Y}=1.8); 4. mean and variance that define hyperprior on gene-specific scaling parameter (m_{g}), allowing the user to define prior beliefs on the sensitivity of spatial technology compared to the scRNA-seq reference.
Joing modelling of multiple experiments
Joint modelling of spatial data sets from multiple experiments provides the several benefits due to sharing information between experiments (such as 10X Visium chips (i.e. square capture areas) and Slide-Seq V2 pucks (i.e. beads)):
Increasing accuracy by improving the ability of the model to distinguish low sensitivity (m_{g}) from zero cell abundance (w_{r,f}), which is achieved by sharing the change in sensitivity between technologies (m_{g}) across experiments. Similarly to common practice in single cell data analysis, this is equivalent to regressing out the effect of technology but not the effect of individual experiment.
Increasing sensitivity by sharing information on cell types with co-varying abundances during decomposition of (w_{s,f}) into groups of cell types (r={1, .., R}).
Training cell2location: specifying data input and hyper-parameters
在這里叙谨,展示了如何訓(xùn)練 cell2location 模型來估計(jì)每個(gè)位置的細(xì)胞豐度。 此工作流包裝在單個(gè)pipeline中:
sc.settings.set_figure_params(dpi = 100, color_map = 'viridis', dpi_save = 100,
vector_friendly = True, format = 'pdf',
facecolor='white')
r = cell2location.run_cell2location(
# Single cell reference signatures as pd.DataFrame
# (could also be data as anndata object for estimating signatures
# as cluster average expression - `sc_data=adata_snrna_raw`)
sc_data=inf_aver,
# Spatial data as anndata object
sp_data=adata_vis,
# the column in sc_data.obs that gives cluster idenitity of each cell
summ_sc_data_args={'cluster_col': "annotation_1",
},
train_args={'use_raw': True, # By default uses raw slots in both of the input datasets.
'n_iter': 40000, # Increase the number of iterations if needed (see QC below)
# Whe analysing the data that contains multiple experiments,
# cell2location automatically enters the mode which pools information across experiments
'sample_name_col': 'sample'}, # Column in sp_data.obs with experiment ID (see above)
export_args={'path': results_folder, # path where to save results
'run_name_suffix': '' # optinal suffix to modify the name the run
},
model_kwargs={ # Prior on the number of cells, cell types and co-located groups
'cell_number_prior': {
# - N - the expected number of cells per location:
'cells_per_spot': 8, # < - change this
# - A - the expected number of cell types per location (use default):
'factors_per_spot': 7,
# - Y - the expected number of co-located cell type groups per location (use default):
'combs_per_spot': 7
},
# Prior beliefs on the sensitivity of spatial technology:
'gene_level_prior':{
# Prior on the mean
'mean': 1/2,
# Prior on standard deviation,
# a good choice of this value should be at least 2 times lower that the mean
'sd': 1/4
}
}
####### Cell2location model output
The results are saved to:
results_folder + r['run_name']
The absolute abundances of cell types are added to sp_data as columns of sp_data.obs. The estimates of all parameters in the model are exported to sp_data.uns['mod']
List of output files:
- sp.h5ad - Anndata object with all results and spatial data.
- W_cell_density.csv - absolute abundances of cell types, mean of the posterior distribution.
- (default) - W_cell_density_q05.csv - absolute abundances of cell types, 5% quantile of the posterior distribution representing confident cell abundance level.
- W_mRNA_count.csv - absolute mRNA abundance for each cell types, mean of the posterior distribution.
- (useful for QC, selecting mapped cell types) - W_mRNA_count_q05.csv - absolute mRNA abundance for each cell types, 5% quantile of the posterior distribution representing confident cell abundance level.
Evaluating training
需要通過檢查一些診斷圖來檢查我們的模型是否訓(xùn)練成功保屯。
首先手负,我們看一下訓(xùn)練迭代中的 ELBO 損失/成本函數(shù)。 該圖省略了前 20% 的訓(xùn)練迭代姑尺,在此期間損失發(fā)生了許多數(shù)量級(jí)的變化竟终。 在這里我們看到模型在訓(xùn)練結(jié)束時(shí)收斂,ELBO 損失函數(shù)中的一些噪聲是可以接受的股缸。 如果在最近的幾千次迭代中有很大的變化衡楞,我們建議增加 'n_iter' 參數(shù)。 (需要很多的數(shù)學(xué)知識(shí))
訓(xùn)練迭代之間 ELBO 損失的差異表明訓(xùn)練問題可能是由于細(xì)胞類型的參考不完整或不夠詳細(xì)造成的敦姻。
from IPython.display import Image
Image(filename=results_folder +r['run_name']+'/plots/training_history_without_first_20perc.png',
width=400)
We also need to evaluate the reconstruction accuracy: how well reference cell type signatures explain spatial data by comparing expected value of the model (\mu_{s,g}) (Negative Binomial mean) to observed count of each gene across locations. The ideal case is a perfect diagonal 2D histogram plot (across genes and locations).
A very fuzzy diagonal or large deviations of some genes and locations from the diagonal plot indicate that the reference signatures are incomplete. The reference could be missing certain cell types entirely (e.g. FACS-sorting one cell lineage) or clustering could be not sufficiently granular (e.g. mapping 5-10 broad cell types to a complex tissue). Below is an example of good performance:
Image(filename=results_folder +r['run_name']+'/plots/data_vs_posterior_mean.png',
width=400)
最后瘾境,需要通過比較兩次獨(dú)立訓(xùn)練重新啟動(dòng)(X 軸和 Y 軸)之間估計(jì)細(xì)胞豐度的一致性來評(píng)估識(shí)別位置的穩(wěn)健性。 下圖顯示了 2 次訓(xùn)練重啟中細(xì)胞豐度曲線之間的相關(guān)性(顏色)镰惦。 某些細(xì)胞類型可能是相關(guān)的迷守,但與對(duì)角線的過度偏差將表明解決方案的不穩(wěn)定性。
Image(filename=results_folder +r['run_name']+'/plots/evaluate_stability.png',
width=400)
第三部分旺入,單細(xì)胞空間聯(lián)合
Loading cell2location model output
首先兑凿,讓我們加載 cell2location 結(jié)果。 在 cell2location 管道的導(dǎo)出步驟中茵瘾,將跨位置的細(xì)胞類型豐度作為 sp_data.obs 的列添加到 sp_data 中礼华,并將模型的所有參數(shù)導(dǎo)出到 sp_data.uns['mod']。 這個(gè) anndata 對(duì)象和一個(gè)帶有spot位置的 csv 文件 W.csv / W_q05.csv 被保存到結(jié)果目錄中拗秘。
Normally, you would have the output on your system (e.g. by running tutorial 2/3), however, you could also start with the output from our data portal:
results_folder = './results/mouse_brain_snrna/'
r = {'run_name': 'LocationModelLinearDependentWMultiExperiment_2experiments_59clusters_5563locations_12809genes'}
# defining useful function
def select_slide(adata, s, s_col='sample'):
r""" Select data for one slide from the spatial anndata object.
:param adata: Anndata object with multiple spatial samples
:param s: name of selected sample
:param s_col: column in adata.obs listing sample name for each location
"""
slide = adata[adata.obs[s_col].isin([s]), :]
s_keys = list(slide.uns['spatial'].keys())
s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]
slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}
return slide
if os.path.exists(f'{results_folder}{r["run_name"]}') is not True:
os.mkdir('./results')
os.mkdir(f'{results_folder}')
os.system(f'cd {results_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_results/{["run_name"]}.zip')
os.system(f'cd {results_folder} && unzip {r["run_name"]}.zip')
We load the results of the model saved into the adata_vis Anndata object:
sp_data_file = results_folder +r['run_name']+'/sp.h5ad'
adata_vis = anndata.read(sp_data_file)
Visualisation of cell type locations
首先圣絮,我們學(xué)習(xí)如何使用標(biāo)準(zhǔn)掃描繪圖工具 sc.pl.spatial 和我們的自定義工具可視化細(xì)胞類型位置,該工具使用顏色插值 cell2location.plt.mapping_video.plot_spatial 在一個(gè)圖中可視化幾種細(xì)胞類型雕旨。
Cell2location 估計(jì)參考細(xì)胞類型的絕對(duì)細(xì)胞和 mRNA 豐度扮匠。 對(duì)于這兩種測(cè)量捧请,后驗(yàn)分布的 5% 分位數(shù)用于顯示結(jié)果,代表細(xì)胞豐度和 mRNA 計(jì)數(shù)的置信水平棒搜。
For completeness, for each visium section, sc.pl.spatial was used to produce 4 figure panels showing the locations of all cell types (cell and mRNA abundance, 5% and the mean of the posterior distribution), saved to r['run_name']/plots/spatial/.
在這里疹蛉,使用絕對(duì)細(xì)胞密度(5% 分位數(shù))在一張圖中可視化多種細(xì)胞類型的位置,代表這一點(diǎn)的模型參數(shù)稱為 q05_spot_factors力麸。 顯示了映射到小鼠大腦 6 個(gè)不同區(qū)域的 6 種神經(jīng)元和神經(jīng)膠質(zhì)細(xì)胞類型可款。
from cell2location.plt.mapping_video import plot_spatial
# select up to 6 clusters
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]
slide = select_slide(adata_vis, 'ST8059048')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
coords=slide.obsm['spatial'] \
* list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
show_img=True, img_alpha=0.8,
style='fast', # fast or dark_background
img=list(slide.uns['spatial'].values())[0]['images']['hires'],
circle_diameter=6, colorbar_position='right')
We can produce this visualisation in dark background by setting style='dark_background' and hiding the image img_alpha=0.
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
coords=slide.obsm['spatial'] \
* list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
show_img=True, img_alpha=0,
style='dark_background', # fast or dark_background
img=list(slide.uns['spatial'].values())[0]['images']['hires'],
circle_diameter=6, colorbar_position='right')
現(xiàn)在,我們將細(xì)胞豐度估計(jì)(上圖)與每種細(xì)胞類型的估計(jì) mRNA 豐度進(jìn)行比較末盔。 這對(duì)于識(shí)別哪些細(xì)胞類型沒有映射到特定組織通常很有用(mRNA 計(jì)數(shù) < 50 - 注意顏色條上的最大值)筑舅,代表這一點(diǎn)的模型參數(shù)稱為 q05_nUMI_factors座慰。
# select up to 6 clusters
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_nUMI_factors' + str(i) for i in sel_clust]
slide = select_slide(adata_vis, 'ST8059048')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
coords=slide.obsm['spatial'] \
* list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
show_img=True, img_alpha=0.8, max_color_quantile=0.98,
img=list(slide.uns['spatial'].values())[0]['images']['hires'],
circle_diameter=6, colorbar_position='right')
#sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
#sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]
# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')
# plot with nice names
with mpl.rc_context({'figure.figsize': (10, 10), "font.size": 18}):
# add slide.obs with nice names
slide.obs[sel_clust] = (slide.obs[sel_clust_col])
sc.pl.spatial(slide, cmap='magma',
color=sel_clust[0:6], # limit size in this notebook
ncols=3,
size=0.8, img_key='hires',
alpha_img=0.9,
vmin=0, vmax='p98'
)
Next, we show how to use the standard scanpy pipeline to plot cell locations over histology images (for more extensive information refer to scanpy):
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]
# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')
# plot with nice names
with mpl.rc_context({'figure.figsize': (10, 10), "font.size": 18}):
# add slide.obs with nice names
slide.obs[sel_clust] = (slide.obs[sel_clust_col])
sc.pl.spatial(slide, cmap='magma',
color=sel_clust[0:6], # limit size in this notebook
ncols=3,
size=0.8, img_key='hires',
alpha_img=0.9,
vmin=0, vmax='p99.2'
)
Identifying tissue regions by clustering
We identify tissue regions that differ in their cell composition by clustering locations using cell abundance estimated by cell2location.
通過使用每種細(xì)胞類型的估計(jì)細(xì)胞豐度對(duì) Visium 點(diǎn)進(jìn)行聚類來找到組織區(qū)域陨舱。 我們構(gòu)建了一個(gè) K-nearest neigbour (KNN) 圖,表示估計(jì)細(xì)胞豐度中位置的相似性版仔,然后應(yīng)用 Leiden 聚類游盲。 KNN 鄰居的數(shù)量應(yīng)適應(yīng)數(shù)據(jù)集的大小和解剖學(xué)定義區(qū)域的大小(即海馬區(qū)域相當(dāng)小蛮粮,因此可能被大型 n_neighbors 掩蓋)益缎。 這可以針對(duì)范圍 KNN 鄰居和 Leiden 聚類分辨率完成,直到獲得與組織解剖結(jié)構(gòu)匹配的聚類然想。
The clustering is done jointly across all Visium sections / batches, hence the region identities are directly comparable. When there are strong technical effects between multiple batches (not the case here) sc.external.pp.bbknn can be used to account for those effects during the KNN construction.
The resulting clusters are saved in adata_vis.obs['region_cluster']
.
sample_type = 'q05_nUMI_factors'
col_ind = [sample_type in i for i in adata_vis.obs.columns.tolist()]
adata_vis.obsm[sample_type] = adata_vis.obs.loc[:,col_ind].values
# compute KNN using the cell2location output
sc.pp.neighbors(adata_vis, use_rep=sample_type,
n_neighbors = 20)
# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1)
# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"]
adata_vis.obs["region_cluster"] = adata_vis.obs["region_cluster"].astype("category")
Visualise the regions in UMAP based on cell abundances and in 2D
在這里莺奔,我們使用相同的 KNN 圖表示在細(xì)胞豐度方面的相似性位置,以執(zhí)行所有位置的 UMAP 投影变泄。 我們可以看到 cell2location 成功地整合了 2 個(gè)部分令哟。 您可以看到皮層中具有類似位置的區(qū)域(下方的 2D)由來自兩個(gè)樣本的點(diǎn)組成(例如區(qū)域cluster 14、16妨蛹、0 - 皮質(zhì)層 L4屏富、L5 和 L6)。
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
with mpl.rc_context({'figure.figsize': (8, 8)}):
sc.pl.umap(adata_vis, color=['sample', 'region_cluster'], size=30,
color_map = 'RdPu', ncols = 2, legend_loc='on data',
legend_fontsize=10)
# Plot the region identity of each location in 2D space
# Plotting UMAP of integrated datasets before 2D plots of separate sections ensures
# consistency of colour scheme via `adata_vis.uns['region_cluster_colors']`.
with mpl.rc_context({'figure.figsize': (5, 6)}):
sc.pl.spatial(select_slide(adata_vis, 'ST8059048'),
color=["region_cluster"], img_key=None
);
sc.pl.spatial(select_slide(adata_vis, 'ST8059052'),
color=["region_cluster"], img_key=None
)
######## Export regions for import to 10X Loupe Browser
我們的區(qū)域圖可以在組織學(xué)圖像上可視化蛙卤,并使用 10X 放大鏡瀏覽器進(jìn)行交互探索(請(qǐng)參閱 10X 網(wǎng)站了解說明)狠半。
# save maps for each sample separately
sam = np.array(adata_vis.obs['sample'])
for i in np.unique(sam):
s1 = adata_vis.obs[['region_cluster']]
s1 = s1.loc[sam == i]
s1.index = [x[10:] for x in s1.index]
s1.index.name = 'Barcode'
s1.to_csv(results_folder +r['run_name']+'/region_cluster29_' + i + '.csv')
Identify groups of co-located cell types using matrix factorisation(識(shí)別共定位的細(xì)胞類型)
在這里,我們使用估計(jì)的細(xì)胞豐度作為非負(fù)矩陣分解的輸入來識(shí)別共同定位的細(xì)胞類型 (R) 的組颤难,這可以解釋為細(xì)胞區(qū)室或組織區(qū)域神年。直觀地,我們假設(shè)細(xì)胞相互作用可以驅(qū)動(dòng)細(xì)胞類型豐度的線性依賴性行嗤,此外已日,我們觀察到具有高度空間交錯(cuò)的細(xì)胞比對(duì)的組織,如人類淋巴結(jié)昂验,用 NMF 比用離散cluster蛋白更好地描述捂敌。
Tip If you want to find a few most disctinct cellular compartments, use a small number of factors. If you want to find very strong co-location signal and assume that most cell types don’t co-locate, use a lot of factors (> 30 - used here). In practice, it is better to train NMF for a range of factors (R={5, .., 30}) and select (R) as a balance between capturing fine tissue zones and splitting known compartments
# number of cell type combinations - educated guess assuming that most cell types don't co-locate
n_fact = int(30)
# extract cell abundance from cell2location
X_data = adata_vis.uns['mod']['post_sample_q05']['spot_factors']
import cell2location.models as c2l
# create model class
mod_sk = c2l.CoLocatedGroupsSklearnNMF(n_fact, X_data,
n_iter = 10000,
verbose = True,
var_names=adata_vis.uns['mod']['fact_names'],
obs_names=adata_vis.obs_names,
fact_names=['fact_' + str(i) for i in range(n_fact)],
sample_id=adata_vis.obs['sample'],
init='random', random_state=0,
nmf_kwd_args={'tol':0.0001})
# train 5 times to evaluate stability
mod_sk.fit(n=5, n_type='restart')
現(xiàn)在艾扮,讓我們檢查一些診斷圖。 首先占婉,您可以看到大多數(shù)細(xì)胞類型組合在此模型的訓(xùn)練重新啟動(dòng)之間是一致的(具有高相關(guān)性的對(duì)角線)泡嘴。 使用第一次重新啟動(dòng)(y 軸),因此我們可以注意到因子 21逆济、23酌予、25(基于 0)不是很穩(wěn)健。
## Do some diagnostics
# evaluate stability by comparing trainin restarts
with mpl.rc_context({'figure.figsize': (10, 8)}):
mod_sk.evaluate_stability('cell_type_factors', align=True)
接下來奖慌,我們?cè)u(píng)估 NMF 細(xì)胞類型組在解釋單個(gè)細(xì)胞類型的豐富程度方面的準(zhǔn)確性抛虫。 您應(yīng)該會(huì)看到一個(gè)對(duì)角線 2D 直方圖,其中比較了輸入細(xì)胞密度數(shù)據(jù)(X 軸)和模型的估算值(Y 軸)简僧。 在這里建椰,我們對(duì)低豐度細(xì)胞類型進(jìn)行了一些小偏差。
# evaluate accuracy of the model
mod_sk.compute_expected()
mod_sk.plot_posterior_mu_vs_data()
Finally, let’s investigate the composition of each NMF cell type group. We use our model to compute the relative contribution of NMF groups to each cell type ('cell_type_fractions' e.g. 45% of cell abundance of Astro_THAL_hab can be explained by fact_10). Note: factors are exchangeable so while you find consistent factors, each model training restart will output those factors in a different order.
Here we export these parameters from the model into adata_vis.uns['mod_sklearn']
in the spatial anndata object, and print the cell types most specific to each NMF group:
# extract parameters into DataFrames
mod_sk.sample2df(node_name='nUMI_factors', ct_node_name = 'cell_type_factors')
# export results to scanpy object
adata_vis = mod_sk.annotate_adata(adata_vis) # as columns to .obs
adata_vis = mod_sk.export2adata(adata_vis, slot_name='mod_sklearn') # as a slot in .uns
# print the fraction of cells of each type located to each combination
mod_sk.print_gene_loadings(loadings_attr='cell_type_fractions',
gene_fact_name='cell_type_fractions')
# make nice names
from re import sub
mod_sk.cell_type_fractions.columns = [sub('mean_cell_type_factors', '', i)
for i in mod_sk.cell_type_fractions.columns]
# plot co-occuring cell type combinations
mod_sk.plot_gene_loadings(mod_sk.var_names_read, mod_sk.var_names_read,
fact_filt=mod_sk.fact_filt,
loadings_attr='cell_type_fractions',
gene_fact_name='cell_type_fractions',
cmap='RdPu', figsize=[10, 15])
Finally, we need to examine the abundance of each cell type group across locations:
# plot cell density in each combination
with mpl.rc_context({'figure.figsize': (5, 6), 'axes.facecolor': 'black'}):
# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')
sc.pl.spatial(slide,
cmap='magma',
color=mod_sk.location_factors_df.columns,
ncols=6,
size=1, img_key='hires',
alpha_img=0,
vmin=0, vmax='p99.2'
)
Now we save the NMF model object to work with later (rememeber, every time you train the model, factors with the same composition will have a different order):
# save co-location models object
def pickle_model(mod, path, file_suffix=''):
file = path + 'model_' + str(mod.__class__.__name__) + '_' + str(mod.n_fact) + '_' + file_suffix + ".p"
pickle.dump({'mod': mod, 'fact_names': mod.fact_names}, file = open(file, "wb"))
print(file)
pickle_model(mod_sk, results_folder +r['run_name'] + '/', file_suffix='')
To aid this analysis, we wrapped the analysis shown above into a pipeline that automates training the NMF model with varying number of factors (including export of the same plots and data as shown above).
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
adata_vis, model_name='CoLocatedGroupsSklearnNMF',
verbose=False, return_all=True,
train_args={
'n_fact': np.arange(10, 40), # IMPORTANT: range of number of factors (10-40 here)
'n_iter': 20000, # maximum number of training iterations
'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
'mode': 'normal',
'n_type': 'restart', 'n_restarts': 5 # number of training restarts
},
model_kwargs={'init': 'random', 'random_state': 0, 'nmf_kwd_args': {'tol': 0.00001}},
posterior_args={},
export_args={'path': results_folder + 'std_model/'+r['run_name']+'/CoLocatedComb/',
'run_name_suffix': ''})
生活很好岛马,有你更好