隔離的第10天饿这,孤獨(dú)仍在秧耗,且行且珍惜降传,每個人都會做一些選擇羞芍,選擇之后,珍惜眼前人魂毁。好了玻佩,今天我們分享的方法是10X單細(xì)胞、10XATAC席楚、10X空間轉(zhuǎn)錄組聯(lián)合分析相互之間的聯(lián)合分析咬崔,參考的文章在A versatile and scalable single-cell data integration algorithm based on domain-adversarial and variational approximation, 2021年9月發(fā)表于Briefings in Bioinformatics,IF11分酣胀,純方法論的文章刁赦,這個影響因子已經(jīng)很高了,我們先來看看原理闻镶,分享一下示例代碼甚脉。
在過去的十年中,單細(xì)胞測序技術(shù)已經(jīng)成為一種非常敏感的技術(shù)铆农,可以定量測量基因表達(dá)水平牺氨、DNA 甲基化landscope、染色質(zhì)可及性墩剖、單細(xì)胞水平的原位表達(dá)猴凹。 大量的單細(xì)胞數(shù)據(jù)集跨越不同的技術(shù)、生物體和方式產(chǎn)生岭皂,一些大規(guī)模的綜合性單細(xì)胞圖譜正在建設(shè)中郊霎,幾乎涵蓋生物學(xué)和復(fù)雜疾病的方方面面。 因此爷绘,面臨著開發(fā)可擴(kuò)展且有效的方法來整合跨樣本书劝、技術(shù)和模式的大型單細(xì)胞數(shù)據(jù)集的挑戰(zhàn),并獲得對復(fù)雜組織中細(xì)胞異質(zhì)性土至、生物狀態(tài)/細(xì)胞類型购对、細(xì)胞發(fā)育和空間模式的生物學(xué)見解。
單細(xì)胞數(shù)據(jù)集成的主要問題是消除各種數(shù)據(jù)噪聲陶因,例如批處理效應(yīng)骡苞,這些噪聲阻礙了比較兩個或多個異質(zhì)組織的方式。在過去的十年中,已經(jīng)提出了許多算法來解決這個問題解幽,而不同的算法可能專注于不同類型的數(shù)據(jù)贴见,并有自己的特定優(yōu)勢。基于參考的集成算法包括 scmap 和 scAlign亚铁,它們將參考 scRNA-seq 圖集的注釋轉(zhuǎn)移到query scRNA-seq 數(shù)據(jù)上蝇刀,但這些方法無法預(yù)測新的細(xì)胞類型螟加。一些專門為bulk RNA-seq 設(shè)計的方法也可用于 scRNA-seq 整合徘溢,而他們的模型強(qiáng)烈假設(shè)每批的細(xì)胞組成是相同的。它們包括combat捆探、RUVseq 和 limma然爆。此外,還提出了一些基于因子分析的算法黍图,如 scMerge曾雕、LIGER、SPOTLight 和 Duren 的方法助被。然而剖张,這些算法由于其高計算資源消耗而難以集成大規(guī)模數(shù)據(jù)集。提出了包括 DCA揩环、scVI搔弄、scGen 和 DESC 在內(nèi)的深度學(xué)習(xí)方法的變體,用于基于自動編碼器或變分自動編碼器集成 scRNA-seq 數(shù)據(jù)丰滑,可以從瓶頸層獲得無批次細(xì)胞表示顾犹。然而,由于它們的底層模型是專門為 scRNA-seq 數(shù)據(jù)設(shè)計的褒墨,因此這些方法在跨模式對齊單細(xì)胞數(shù)據(jù)方面可能不太有效炫刷。例如,scVI 使用分層貝葉斯模型將計數(shù)表達(dá)式數(shù)據(jù)擬合到零膨脹負(fù)二項分布中郁妈。另一種有效的策略是基于相互最近的鄰居 (MNN)浑玛,它首先用于在 mnnCorrect 中檢測跨 scRNA-seq 批次的相似細(xì)胞pair。 mnnCorrect 方法通過對許多 MNN 對進(jìn)行平均來獲得批校正向量噩咪,但輸入數(shù)據(jù)集的順序可能會導(dǎo)致次優(yōu)解決方案顾彰,因為它使用連續(xù)集成策略。受 MNN 的啟發(fā)剧腻,提出了另外兩種類似的算法:Seurat 3.0 和 Scanorama拘央。 Seurat 3.0 對其配對數(shù)據(jù)集中的每個細(xì)胞使用 k-MNN 來識別匹配對,稱為“錨點(diǎn)”书在,基于通過典型相關(guān)分析 (CCA) 減少的細(xì)胞嵌入灰伟。盡管 Seurat 可以跨模式對齊單細(xì)胞數(shù)據(jù),但它依賴于不同的策略來捕獲 scATAC-seq 數(shù)據(jù)的生物結(jié)構(gòu),而不是 CCA栏账。 Scanorama 采用一種廣義的相互最近鄰匹配方法帖族,在基于 SVD 的嵌入上,在所有 scRNA-seq 數(shù)據(jù)集中而不是配對數(shù)據(jù)集中找到相似的細(xì)胞挡爵。此外竖般,還有一些其他的集成模型,例如基于圖的模型(例如 BBKNN)茶鹃、基于聚類的模型(例如 Harmony涣雕、DC3)、基于幾何的模型和多模態(tài)交叉模型(例如 MIA)闭翩。在上述現(xiàn)有方法中挣郭,Seurat 3.0、LIGER疗韵、DC3和Stanley的方法能夠跨模態(tài)整合單細(xì)胞數(shù)據(jù)兑障;采用 Duren 方法整合 scRNA-seq 和 scATAC-seq 數(shù)據(jù); SPOTLight蕉汪、 MIA專為整合 scRNA-seq 和空間轉(zhuǎn)錄組數(shù)據(jù)而設(shè)計流译;所有其他只能應(yīng)用于 scRNA-seq 數(shù)據(jù)。
盡管上述方法提供了多種方式以不同策略集成多個單細(xì)胞數(shù)據(jù)集者疤,但只有少數(shù)方法促進(jìn)了跨樣本福澡、技術(shù)和模式的單細(xì)胞數(shù)據(jù)集成; 他們中很少有人表現(xiàn)出整合成對多模態(tài)數(shù)據(jù)的能力宛渐,而且其中大多數(shù)對于大型數(shù)據(jù)集是不可擴(kuò)展的竞漾。 為了解決這些限制,提出了一種通用且可擴(kuò)展的方法窥翩,可以促進(jìn)以下集成任務(wù):(i)將多個 scRNA-seq 集成到圖集參考中业岁; (ii) 將標(biāo)簽從特征良好的 scRNA-seq 轉(zhuǎn)移到 scATAC-seq 數(shù)據(jù)和空間分辨的轉(zhuǎn)錄組; (iii) 多模式單細(xì)胞數(shù)據(jù)的整合和 (iv) 大規(guī)模數(shù)據(jù)集的整合寇蚊。
overview of DAVAE
在這里笔时,考慮了跨模式集成多個 scRNAseq 數(shù)據(jù)集和多個單細(xì)胞數(shù)據(jù)的問題。為了解決這個問題仗岸,提出了一個通用框架允耿,域?qū)购妥兎肿詣泳幋a器(DAVAE),將歸一化的基因表達(dá)(或染色質(zhì)可及性)擬合到非線性模型中扒怖,將潛在變量 z 轉(zhuǎn)換為表達(dá)式具有非線性函數(shù)较锡、KL 正則化器和域?qū)拐齽t化器的空間。如下圖所示盗痒,DAVAE 依賴于深度神經(jīng)多層感知器的結(jié)構(gòu)進(jìn)行回歸蚂蕴,它由變分逼近網(wǎng)絡(luò)、生成貝葉斯神經(jīng)網(wǎng)絡(luò)和域?qū)狗诸惼鹘M成。深度神經(jīng)網(wǎng)絡(luò)使我們能夠有效地從大規(guī)模數(shù)據(jù)集中學(xué)習(xí)回歸模型骡楼。共享低維空間中的潛在因素可用于聚類熔号、軌跡推斷、跨模態(tài)遷移學(xué)習(xí)和許多其他下游綜合分析鸟整。
示例代碼
Integrating multiple scRNA-seq data
Importing scbean package
import scbean.model.davae as davae
import scbean.tools.utils as tl
import scanpy as sc
import matplotlib
from numpy.random import seed
seed(2021)
matplotlib.use('TkAgg')
# Command for Jupyter notebooks only
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
Loading data
base_path = "/Users/zhongyuanke/data/vipcca/mixed_cell_lines/"
file1 = base_path+"293t/hg19/"
file2 = base_path+"jurkat/hg19/"
file3 = base_path+"mixed/hg19/"
adata_b1 = tl.read_sc_data(file1, fmt='10x_mtx', batch_name="293t")
adata_b2 = tl.read_sc_data(file2, fmt='10x_mtx', batch_name="jurkat")
adata_b3 = tl.read_sc_data(file3, fmt='10x_mtx', batch_name="mixed")
或者
base_path = "/Users/zhongyuanke/data/vipcca/mixed_cell_lines/"
adata_b1 = tl.read_sc_data(base_path+"293t.h5ad", batch_name="293t")
adata_b2 = tl.read_sc_data(base_path+"jurkat.h5ad", batch_name="jurkat")
adata_b3 = tl.read_sc_data(base_path+"mixed.h5ad", batch_name="mixed")
Data preprocessing
Here, we filter and normalize each data separately and concatenate them into one AnnData object.
adata_all = tl.davae_preprocessing([adata_b1, adata_b2, adata_b3], index_unique="-")
DAVAE Integration
# Command for Jupyter notebooks only
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
adata_integrate = davae.fit_integration(
adata_all,
batch_num=3,
domain_lambda=2.0,
epochs=25,
sparse=True,
hidden_layers=[64, 32, 6]
)
1.The meta.data of each cell has been saved in adata.obs
2.The embedding representation of davae for each cell have been saved in adata.obsm(‘X_davae’)
UMAP Visualization
import umap
adata_integrate.obsm['X_umap']=umap.UMAP().fit_transform(adata_integrate.obsm['X_davae'])
sc.pl.umap(adata_integrate, color=['_batch', 'celltype'], s=3)
空間數(shù)據(jù)的聯(lián)合
Importing scbean package
import scbean.model.davae as davae
from scbean.tools import utils as tl
import scanpy as sc
import matplotlib.pyplot as plt
import matplotlib
from numpy.random import seed
seed(2021)
matplotlib.use('TkAgg')
# Command for Jupyter notebooks only
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
DAVAE integration of two spatial gene expression data
base_path = '/Users/zhongyuanke/data/'
file1_spatial = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/'
file2_spatial = base_path+'spatial/mouse_brain/10x_mouse_brain_Posterior/'
file1 = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.h5'
file2 = base_path+'spatial/mouse_brain/10x_mouse_brain_Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.h5'
adata_spatial_anterior = sc.read_visium(file1_spatial, count_file=file1)
adata_spatial_posterior = sc.read_visium(file2_spatial, count_file=file2)
adata_spatial_anterior.var_names_make_unique()
adata_spatial_posterior.var_names_make_unique()
Data preprocessing
Here, we filter and normalize each dataset separately and concatenate them into one AnnData object.
adata_spatial = tl.spatial_preprocessing([adata_spatial_anterior, adata_spatial_posterior])
DAVAE integration
adata_integrate = davae.fit_integration(
adata_spatial,
epochs=25,
split_by='loss_weight',
hidden_layers=[128, 64, 32, 5],
sparse=True,
domain_lambda=0.5,
)
adata_spatial.obsm["X_davae"] = adata_integrate.obsm['X_davae']
UMAP visualization and clustering
sc.set_figure_params(facecolor="white", figsize=(5, 4))
sc.pp.neighbors(adata_spatial, use_rep='X_davae', n_neighbors=12)
sc.tl.umap(adata_spatial)
sc.tl.louvain(adata_spatial, key_added="clusters")
sc.pl.umap(adata_spatial, color=['library_id', "clusters"],
size=8, color_map='Set2', frameon=False)
Visualization in spatial coordinates
clusters_colors = dict(
zip([str(i) for i in range(18)], adata_spatial.uns["clusters_colors"])
)
fig, axs = plt.subplots(1, 2, figsize=(10, 6))
for i, library in enumerate(
["V1_Mouse_Brain_Sagittal_Anterior", "V1_Mouse_Brain_Sagittal_Posterior"]
):
ad = adata_spatial[adata_spatial.obs.library_id == library, :].copy()
sc.pl.spatial(
ad,
img_key="hires",
library_id=library,
color="clusters",
size=1.5,
palette=[
v
for k, v in clusters_colors.items()
if k in ad.obs.clusters.unique().tolist()
],
legend_loc=None,
show=False,
ax=axs[i],
)
plt.tight_layout()
plt.show()
DAVAE integration of spatial gene expression and scRNA-seq data(單細(xì)胞空間聯(lián)合)
import pandas as pd
from sklearn.metrics.pairwise import cosine_distances
import numpy as np
base_path = '/Users/zhongyuanke/data/'
file_rna = base_path+'spatial/mouse_brain/adata_processed_sc.h5ad'
adata_rna = sc.read_h5ad(file_rna)
file1 = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.h5'
file1_spatial = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/'
adata_spatial_anterior = sc.read_visium(file1_spatial, count_file=file1)
adata_spatial_anterior.var_names_make_unique()
adata_spatial_anterior = adata_spatial_anterior[
adata_spatial_anterior.obsm["spatial"][:, 1] < 6000, :
]
Preprocessing
adata_all = tl.spatial_rna_preprocessing(
adata_spatial_anterior,
adata_rna,
)
DAVAE integration
adata_integrate = davae.fit_integration(
adata_all,
epochs=40,
batch_size=128,
domain_lambda=2.5,
sparse=True,
hidden_layers=[128, 64, 32, 10]
)
Calculate distance
len_anterior = adata_spatial_anterior.shape[0]
len_rna = adata_rna.shape[0]
davae_emb = adata_integrate.obsm['X_davae']
adata_spatial_anterior.obsm["davae_embedding"] = davae_emb[0:len_anterior, :]
adata_rna.obsm['davae_embedding'] = davae_emb[len_anterior:len_rna+len_anterior, :]
distances_anterior = 1 - cosine_distances(
adata_rna.obsm["davae_embedding"],
adata_spatial_anterior.obsm['davae_embedding'],
)
Transfer label
def label_transfer(dist, labels):
lab = pd.get_dummies(labels).to_numpy().T
class_prob = lab @ dist
norm = np.linalg.norm(class_prob, 2, axis=0)
class_prob = class_prob / norm
class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)
return class_prob
class_prob_anterior = label_transfer(distances_anterior, adata_rna.obs.cell_subclass)
cp_anterior_df = pd.DataFrame(
class_prob_anterior,
columns=np.sort(adata_rna.obs.cell_subclass.unique())
)
cp_anterior_df.index = adata_spatial_anterior.obs.index
adata_anterior_transfer = adata_spatial_anterior.copy()
adata_anterior_transfer.obs = pd.concat(
[adata_spatial_anterior.obs, cp_anterior_df],
axis=1
)
Visualize the neurons cortical layers
sc.set_figure_params(facecolor="white", figsize=(2, 2))
sc.pl.spatial(
adata_anterior_transfer,
img_key="hires",
color=["L2/3 IT", "L4", "L5 PT", "L6 CT"],
size=1.5,
color_map='Blues',
)
生活很好引镊,有你更好