隔離的第11天,人生如戲激率,得與失都是很平常的事咳燕,可能我們錯過了很喜歡的姑娘,可能被深愛的人所傷害乒躺,也可能面對所在的城市感到無力留下來招盲,“躺平”或許很對,但努力一定很酷嘉冒。
好了宪肖,今天我們來分享一個單細胞多樣本整合的分析方法,VIPCCA健爬,顧名思義控乾,CCA的進階版,參考文獻在Effective and scalable single-cell data alignment with non-linear canonical correlation analysis娜遵,2021年12月發(fā)表于nucleic acids research,IF 9.1分蜕衡,我們借鑒一下,看看有沒有值得學習的地方设拟。
單細胞測序在基因調(diào)控慨仿、細胞分化和細胞多樣性研究中具有革命性意義。 隨著近年來技術(shù)的顯著改進纳胧,每個實驗檢測的單細胞數(shù)量呈指數(shù)級增長镰吆,同時大規(guī)模研究產(chǎn)生的數(shù)據(jù)集也在快速增長和積累。 因此跑慕,當前單細胞研究中的一個主要計算挑戰(zhàn)是對來自多個不同樣本或跨不同平臺和數(shù)據(jù)類型的測量進行標準化万皿,以進行有效的綜合和比較分析摧找。 這種綜合分析需要開發(fā)單細胞數(shù)據(jù)對齊方法,該方法可以消除批次效應(yīng)并考慮跨數(shù)據(jù)集的技術(shù)噪聲牢硅。
最近開發(fā)了許多單細胞數(shù)據(jù)對齊方法蹬耘。它們中的大多數(shù),除了一些值得注意的例外减余,例如最近的 iNMF综苔,都針對小型和中型數(shù)據(jù)集。這些現(xiàn)有的方法可以概括為四類:(i)基于參考的方法位岔,例如 Scmap-cluster 和 scAlign如筛,它們基于注釋良好的參考數(shù)據(jù)集對齊新的查詢數(shù)據(jù)集; (ii) 基于聚類的方法抒抬,例如 Harmony妙黍、DESC,它們通過迭代優(yōu)化聚類目標函數(shù)來消除批效應(yīng)并在嵌入空間中對齊樣本瞧剖; (iii) 基于匹配的方法拭嫁,例如 MNN 和 Scanorama,它們應(yīng)用相互最近的鄰居策略來識別跨數(shù)據(jù)集的重疊單元格和 (iv) 基于投影的方法抓于,使用統(tǒng)計模型將來自不同數(shù)據(jù)集的單個細胞投影到較低的維空間做粤,包括對投影應(yīng)用典型相關(guān)分析的Seurat ,使用來自非負矩陣分解的潛在因子進行投影的LIGER捉撮, and scVI and others that use variational techniques for projection.
然而怕品,大多數(shù)現(xiàn)有的對齊方法都存在固有缺陷,無法成功應(yīng)用于大型數(shù)據(jù)集巾遭。具體而言肉康,基于參考的方法的對齊將受到參考數(shù)據(jù)大小和參考中可用的預(yù)選細胞類型注釋的限制,因此當數(shù)據(jù)大小增加時灼舍,可能會導致錯過新發(fā)現(xiàn)的機會增加吼和。像 MNN 這樣的基于匹配的方法使用往返游走策略,該策略需要為具有兩個以上樣本的數(shù)據(jù)集生成所有成對對齊骑素,這對于大樣本量來說將是耗時的炫乓。具有復(fù)雜參數(shù)模型的方法(例如 LIGER 和 scAlign)或具有復(fù)雜事后數(shù)據(jù)處理的方法(例如 Seurat )也難以擴展到大型數(shù)據(jù)集。基于 ZINB 的方法(例如 scVI)在捕獲多個數(shù)據(jù)集的復(fù)雜表達特征方面可能效率較低献丑。盡管一些現(xiàn)有的最新方法可以擴展到大型數(shù)據(jù)集末捣,但由于復(fù)雜的參數(shù)模型,它們?nèi)匀挥锌赡懿粶蚀_地對齊細胞创橄。因此箩做,迫切需要開發(fā)在計算上也有效的有效對齊方法。
除了迫切需要開發(fā)可擴展的比對方法外妥畏,當前比對方法的另一個阻礙問題是它們的性能通常僅使用單細胞 RNA 測序 (scRNA-seq) 數(shù)據(jù)進行基準測試和優(yōu)化邦邦。 因此安吁,大多數(shù)現(xiàn)有的比對方法不適合整合其他單細胞測序數(shù)據(jù)類型,例如使用測序 (scATAC-seq) 進行轉(zhuǎn)座酶可及染色質(zhì)的單細胞測定圃酵。 此外柳畔,現(xiàn)有的比對方法(如 Seurat)返回的結(jié)果只能保留真實的細胞間關(guān)系(或相似性)馍管,而不能代表基因表達水平郭赐,不適合進行差異表達分析或富集分析等下游分析。
為了應(yīng)對這些挑戰(zhàn)确沸,作者提出了一個統(tǒng)一的計算框架 VIPCCA捌锭,它基于非線性概率典型相關(guān)分析,用于有效且可擴展的單細胞數(shù)據(jù)對齊罗捎。 VIPCCA 利用來自深度神經(jīng)網(wǎng)絡(luò)的尖端技術(shù)對單細胞數(shù)據(jù)進行非線性建模观谦,從而允許用戶通過跨技術(shù)、數(shù)據(jù)類型桨菜、條件和模式的多個單細胞數(shù)據(jù)集的集成來捕獲復(fù)雜的生物結(jié)構(gòu)豁状。此外,VIPCCA 依靠變分推理來進行可擴展計算倒得,從而能夠?qū)⒋笠?guī)模單細胞數(shù)據(jù)集與數(shù)百萬個細胞有效集成泻红。重要的是,VIPCCA 可以將多模態(tài)轉(zhuǎn)換為低維空間霞掺,而無需任何事后數(shù)據(jù)處理谊路,這是與現(xiàn)有對齊方法形成直接對比的獨特且理想的功能。
示例代碼菩彬,這里要結(jié)合上一篇分享的內(nèi)容10X單細胞空間轉(zhuǎn)錄組聯(lián)合分析之DAVAE
Integrating multiple scRNA-seq data
加載
import scbean.model.vipcca as vip
import scbean.tools.utils as tl
import scbean.tools.plotting as pl
import matplotlib
import scanpy as sc
matplotlib.use('TkAgg')
# Command for Jupyter notebooks only
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
Loading data
base_path = "/Users/zhongyuanke/data/vipcca/mixed_cell_lines/"
file1 = base_path+"293t/hg19/"
file2 = base_path+"jurkat/hg19/"
file3 = base_path+"mixed/hg19/"
adata_b1 = tl.read_sc_data(file1, fmt='10x_mtx', batch_name="293t")
adata_b2 = tl.read_sc_data(file2, fmt='10x_mtx', batch_name="jurkat")
adata_b3 = tl.read_sc_data(file3, fmt='10x_mtx', batch_name="mixed")
或者
base_path = "/Users/zhongyuanke/data/vipcca/mixed_cell_lines/"
adata_b1 = tl.read_sc_data(base_path+"293t.h5ad", batch_name="293t")
adata_b2 = tl.read_sc_data(base_path+"jurkat.h5ad", batch_name="jurkat")
adata_b3 = tl.read_sc_data(base_path+"mixed.h5ad", batch_name="mixed")
Data preprocessing
adata_all = tl.preprocessing([adata_b1, adata_b2, adata_b3], index_unique="-")
VIPCCA Integration
# Command for Jupyter notebooks only
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
# construct a vipcca object
handle = vip.VIPCCA(
adata_all=adata_all,
res_path='/Users/zhongyuanke/data/vipcca/mixed_cell_lines/tutorials/',
mode='CVAE',
split_by="_batch",
epochs=20,
lambda_regulizer=5,
batch_input_size=128,
batch_input_size2=16
)
# do integration and return an AnnData object
adata_integrate = handle.fit_integrate()
1.The meta.data of each cell has been saved in adata.obs
2.The embedding representation from vipcca of each cell have been saved in adata.obsm(‘X_vipcca’)
Loading result
- Loading result from saved model.h5 file through vipcca: The model.h5 file of the trained result can be downloaded here
model_path = 'model.h5'
handle = vip.VIPCCA(
adata_all=adata_all,
res_path='/Users/zhongyuanke/data/vipcca/mixed_cell_lines/tutorials/',
mode='CVAE',
split_by="_batch",
epochs=20,
lambda_regulizer=5,
batch_input_size=128,
batch_input_size2=16,
model_file=model_path
)
adata_integrate = handle.fit_integrate()
- Loading result from h5ad file: The output.h5ad file of the trained result can be downloaded here
integrate_path = '/Users/zhongyuanke/data/vipcca/mixed_cell_lines/tutorials/output.h5ad'
adata_integrate = sc.read_h5ad(integrate_path)
UMAP Visualization
sc.pp.neighbors(adata_integrate, use_rep='X_vipcca')
sc.tl.umap(adata_integrate)
sc.set_figure_params(figsize=[5.5,4.5])
sc.pl.umap(adata_integrate, color=['_batch','celltype'], use_raw=False, show=True,)
Plot correlation
該函數(shù)僅適用于 fit_integrate() 函數(shù)訓練生成的 AnnData缠劝。 在基因表達矩陣中隨機選擇 2000 個位置。 x軸代表這些位置原始數(shù)據(jù)的表達值骗灶,y軸代表同一位置的vipcca整合后數(shù)據(jù)的表達值惨恭。
pl.plotCorrelation(adata_integrate.raw.X, adata_integrate.X, save=False, rnum=2000, lim=10)
scRNA-seq+scATAC-seq integration
Preprocessing with R
- For the scRNA-seq data: We obtained 13 cell types using a standard workflow in Seurat. The 13 cell types include 460 B cell progenitor, 2,992 CD14+ Monocytes, 328 CD16+ Monocytes, 1,596 CD4 Memory, 1,047 CD4 Na?ve, 383 CD8 effector, 337 CD8 Na?ve, 74 Dendritic cell, 592 Double negative T cell, 544 NK cell, 68 pDC, 52 Plateletes, and 599 pre-B cell.
- For the scATAC-seq data: First, we load in the provided peak matrix and collapse the peak matrix to a “gene activity matrix” using Seurat. Then we filtered out cells that have with fewer than 5,000 total peak counts to focus on a final set of 7,866 cells for analysis. See the Seurat website for tutorial and documentation for analysing scATAC-seq data.
library(Seurat)
library(ggplot2)
library(patchwork)
peaks <- Read10X_h5("/Users/zhongyuanke/data/seurat_data/sc_atac/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
# create a gene activity matrix
activity.matrix <- CreateGeneActivityMatrix(
peak.matrix = peaks,
annotation.file = "/Users/zhongyuanke/data/seurat_data/sc_atac/Homo_sapiens.GRCh37.82.gtf",
seq.levels = c(1:22, "X", "Y"),
upstream = 2000,
verbose = TRUE
)
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
meta <- read.table(
"/Users/zhongyuanke/data/seurat_data/sc_atac/atac_v1_pbmc_10k_singlecell.csv",
sep = ",",
header = TRUE,
row.names = 1,
stringsAsFactors = FALSE
)
meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
# filter
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 5000)
pbmc.atac$tech <- "atac"
After filter, we converting Seurat Object to AnnData via h5Seurat using R packages. In this case, the atac.h5ad file will be generated in the corresponding path.
library(SeuratDisk)
SaveH5Seurat(atac, filename = "atac.h5Seurat")
Convert("atac.h5Seurat", dest = "h5ad")
Importing scbean package
import scbean.model.vipcca as vip
import scbean.tools.utils as tl
import scbean.tools.plotting as pl
import scbean.tools.transferLabel as tfl
import scanpy as sc
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('TkAgg')
# Command for Jupyter notebooks only
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
adata_rna = tl.read_sc_data("/Users/zhongyuanke/data/vipcca/atac/rna.h5ad")
adata_atac = tl.read_sc_data("/Users/zhongyuanke/data/vipcca/atac/geneact.h5ad")
Data preprocessing
adata_all= tl.preprocessing([adata_rna, adata_atac])
VIPCCA Integration
# Command for Jupyter notebooks only
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
handle = vip.VIPCCA(adata_all=adata_all,
res_path='/Users/zhongyuanke/data/vipcca/atac_result/',
mode='CVAE',
split_by="_batch",
epochs=20,
lambda_regulizer=2,
batch_input_size=64,
batch_input_size2=14,
)
adata_integrate=handle.fit_integrate()
Cell type prediction
atac=tfl.findNeighbors(adata_integrate)
adata_atac.obs['celltype'] = atac['celltype']
adata = adata_rna.concatenate(adata_atac)
adata_integrate.obs['celltype'] = adata.obs['celltype']
UMAP Visualization
sc.pp.neighbors(adata_integrate, use_rep='X_vipcca')
sc.tl.umap(adata_integrate)
sc.set_figure_params(figsize=[5.5, 4.5])
sc.pl.umap(adata_integrate, color=['_batch', 'celltype'])
生活很好,有你更好