開工第一彈弊琴,我們來看看最新的10X單細(xì)胞聯(lián)合ATAC的分析方法,文章在scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning俩檬,2022年1月發(fā)表于nature biotechnology,IF54分题画,相當(dāng)高了~~~~我們來看一下祭隔,其實(shí)這里要解決的就是多組學(xué)的聯(lián)合分析問題篮灼,下面列舉了一些我之前分享的方法忘古,供大家借鑒和參考。
10X單細(xì)胞轉(zhuǎn)錄組整合诅诱、轉(zhuǎn)錄組 && ATAC整合分析之VIPCCA
10X單細(xì)胞(10X空間轉(zhuǎn)錄組)多組學(xué)(RNA + ATAC)推斷Velovyto(MultiVelo
10X單細(xì)胞 & 10XATAC 聯(lián)合分析表征細(xì)胞調(diào)控網(wǎng)絡(luò)(MIRA)
10X單細(xì)胞 & 10XATAC聯(lián)合之調(diào)控網(wǎng)絡(luò)分析(IReNA)
10X單細(xì)胞(10X空間轉(zhuǎn)錄組)數(shù)據(jù)分析遷移之scGCN
10X單細(xì)胞核轉(zhuǎn)錄組 + 10X單細(xì)胞ATAC的聯(lián)合分析(WNN)
10X單細(xì)胞轉(zhuǎn)錄組與10X單細(xì)胞ATAC聯(lián)合分析之Seurat
當(dāng)然了髓堪,10X單細(xì)胞 && 10XATAC有兩種數(shù)據(jù)類型,一種是多組學(xué),同時(shí)測一個(gè)細(xì)胞內(nèi)的轉(zhuǎn)錄組和ATAC干旁,另外一種是把樣本分成了2份驶沼,一份測轉(zhuǎn)錄組,一份測ATAC争群,大家要根據(jù)自己的情況來甄別一下商乎。
首先我們來分分類,看看這些10X單細(xì)胞 && 10XATAC的聯(lián)合方法的區(qū)別祭阀,以及他們的優(yōu)缺點(diǎn)
- manifold alignment(Manifold alignment methods have demonstrated promising results in integrating several modalities from the same tissue. However, requiring distributions to match globally is often too restrictive when different modalities are derived from different tissues and cell types)
- matrix factorization (Liger, coupled-NMF)(matrix factorization and correlation-based methods designed for unpaired data require a separate feature selection step before integration for dimension reduction, and the method’s performance is sensitive to which genes are selected)
- using correlations to identify nearby cells across modalities(Conos, Seurat)
- neural-network approaches(Most existing neural-network methods for multiomics integration are based on autoencoders, which, with a few exceptions,
require paired data. In general, unsupervised training of several autoencoders simultaneously can be very challenging without pairing information across different modalities, with finding a common embedding manifold becomes more difficult as the complexity of the data increases)
因此,當(dāng)前的方法在處理表征多組學(xué)圖譜數(shù)據(jù)的復(fù)雜性和規(guī)模方面的能力有限鲜戒。
scJoint的改進(jìn)之處
- 一個(gè)新的損失函數(shù)专控,明確地將降維作為轉(zhuǎn)移學(xué)習(xí)中特征工程過程的一部分,允許在整個(gè)訓(xùn)練過程中修改低維特征遏餐,并且無需選擇高度可變的基因伦腐。
- 當(dāng)細(xì)胞類型不完全重疊時(shí),增加了模態(tài)對(duì)齊的靈活性的相似性損失
- weight sharing across encoders for different modalities resulting in stable training
方法核心
scJoint 的核心是對(duì)標(biāo)記(scRNA-seq)和未標(biāo)記(scATAC-seq)數(shù)據(jù)進(jìn)行協(xié)同訓(xùn)練的半監(jiān)督方法失都,解決了通過共同的低維空間對(duì)齊這兩種不同數(shù)據(jù)模式的主要挑戰(zhàn)柏蘑。 scJoint 包括三個(gè)主要步驟。步驟 1 分別通過新的基于神經(jīng)網(wǎng)絡(luò)的降維 (NNDR) 損失和余弦相似度損失在公共嵌入空間中執(zhí)行聯(lián)合降維和模態(tài)對(duì)齊粹庞。 NNDR 損失在類似于 PCA 的vein中提取具有最大可變性的正交特征咳焚,而余弦相似性損失鼓勵(lì)神經(jīng)網(wǎng)絡(luò)找到嵌入空間的投影,以便可以對(duì)齊兩種模式的大部分庞溜。 scRNA-seq 的嵌入進(jìn)一步由細(xì)胞類型分類損失引導(dǎo)革半,形成半監(jiān)督部分。在步驟 2 中流码,將 scATAC-seq 數(shù)據(jù)中的每個(gè)細(xì)胞視為query又官,通過測量它們?cè)诠睬度肟臻g中的距離來識(shí)別 scRNA-seq 細(xì)胞之間的 k 最近鄰(KNN),并從 scRNA 中轉(zhuǎn)移細(xì)胞類型標(biāo)簽-seq 通過“多數(shù)票”通過 scATAC-seq漫试。在第 3 步中六敬,通過在度量學(xué)習(xí)損失中使用轉(zhuǎn)移標(biāo)簽來進(jìn)一步改進(jìn)兩種模式之間的混合。使用標(biāo)準(zhǔn)工具(包括 tSNE 和 UMAP)從最終嵌入層獲得數(shù)據(jù)集的聯(lián)合可視化驾荣。 scJoint 需要簡單的數(shù)據(jù)預(yù)處理外构,輸入維度等于給定數(shù)據(jù)集中經(jīng)過適當(dāng)過濾后的基因數(shù)。 scATAC-seq 數(shù)據(jù)中的染色質(zhì)可及性首先轉(zhuǎn)換為基因活性評(píng)分秘车,從而允許使用單個(gè)編碼器典勇,RNA 和 ATAC 的權(quán)重共享。
方法比較叮趴,我們簡單看一下
scJoint shows accurate and robust performance on atlas data
Refining scATAC-seq annotations in heterogeneous atlas data
Integration of multimodal data across biological conditions
scJoint shows versatile performance on paired data
當(dāng)然了割笙,文章寫的方法,自然是作者的方法效果最好
核心算法,這個(gè)需要大家自己來解讀了
示例代碼
安裝
git clone https://github.com/SydneyBioX/scJoint.git
scJoint是python版本伤溉,如果分析結(jié)果保存成了rds(R版本)般码,數(shù)據(jù)需要轉(zhuǎn)換一下
library(SingleCellExperiment)
library(DropletUtils)
library(scater)
library(ggplot2)
sce_10xPBMC_atac <- readRDS("data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("data_10x/sce_10xPBMC_rna.rds")
# Only keep common genes between two dataset
common_genes <- intersect(rownames(sce_10xPBMC_atac),
rownames(sce_10xPBMC_rna))
length(common_genes)
# Extract the logcounts data from sce object
exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])
source("data_to_h5.R")
write_h5_scJoint(exprs_list = list(rna = exprs_rna,
atac = exprs_atac),
h5file_list = c("data_10x/exprs_10xPBMC_rna.h5",
"data_10x/exprs_10xPBMC_atac.h5"))
write_csv_scJoint(cellType_list = list(rna = sce_10xPBMC_rna$cellTypes),
csv_list = c("data_10x/cellType_10xPBMC_rna.csv"))
data_to_h5.R這個(gè)軟件有提供,大家可以下載乱顾。
Analysis of PBMC data from 10x Genomics using scJoint
import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed(1)
rna_h5_files = ["data_10x/exprs_10xPBMC_rna.h5"]
rna_label_files = ["data_10x/cellType_10xPBMC_rna.csv"] # csv file
atac_h5_files = ["data_10x/exprs_10xPBMC_atac.h5"]
atac_label_files = []
process_db.data_parsing(rna_h5_files, atac_h5_files)
rna_label = pd.read_csv(rna_label_files[0], index_col = 0)
rna_label
print(rna_label.value_counts(sort = False))
process_db.label_parsing(rna_label_files, atac_label_files)
Running scJoint
The scRNA-seq and scATAC-seq have 15463 genes in common. And we have 11 cell types in total in the scRNA_seq adta, so we will set the number of class as 11 in the config.py file. The paths also needed to be edited accordingly. Here are the settings for integration of scRNA-seq and scATAC-seq from 10x Genomics we used:
self.number_of_class = 11 # Number of cell types in scRNA-seq data
self.input_size = 15463 # Number of common genes and proteins between scRNA-seq data and scATAC-seq
self.rna_paths = ['data_10x/exprs_10xPBMC_rna.npz'] # RNA gene expression from scRNA-seq data
self.rna_labels = ['data_10x/cellType_10xPBMC_rna.txt'] # scRNA-seq data cell type labels (coverted to numeric)
self.atac_paths = ['data_10x/exprs_10xPBMC_atac.npz'] # ATAC gene activity matrix from scATAC-seq data
self.atac_labels = []
self.rna_protein_paths = []
self.atac_protein_paths = []
Training config
self.batch_size = 256
self.lr_stage1 = 0.01
self.lr_stage3 = 0.01
self.lr_decay_epoch = 20
self.epochs_stage1 = 20
self.epochs_stage3 = 20
self.p = 0.8
self.embedding_size = 64
self.momentum = 0.9
self.center_weight = 1
self.with_crossentorpy = True
self.seed = 1
self.checkpoint = ''
After modifying config.py, we are ready to run scJoint in terminal with
python3 main.py
This takes ~4 minutes using 1 thread for PyTorch.
Visualisation
rna_embeddings = np.loadtxt('./output/exprs_10xPBMC_rna_embeddings.txt')
atac_embeddings = np.loadtxt('./output/exprs_10xPBMC_atac_embeddings.txt')
print(rna_embeddings.shape)
print(atac_embeddings.shape)
embeddings = np.concatenate((rna_embeddings, atac_embeddings))
print(embeddings.shape)
tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)
tsne_results.shape
df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
rna_labels = np.loadtxt('./data_10x/cellType_10xPBMC_rna.txt')
atac_predictions = np.loadtxt('./output/exprs_10xPBMC_atac_knn_predictions.txt')
labels = np.concatenate((rna_labels, atac_predictions))
label_to_idx = pd.read_csv('./data_10x/label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
label_dic = np.append(label_dic, label_to_idx[0][i][:-2])
data_label = np.array(["scRNA-seq", "scATAC-seq"])
df['data'] = np.repeat(data_label, [rna_embeddings.shape[0], atac_embeddings.shape[0]], axis=0)
df['predicted'] = label_dic[labels.astype(int)]
plt.figure(figsize=(10,10))
sns.scatterplot(
x = "tSNE1", y = "tSNE2",
hue = "data",
palette = sns.color_palette("tab10", 2),
data = df,
legend = "full",
alpha = 0.3
)
plt.figure(figsize=(10,10))
sns.scatterplot(
x = "tSNE1", y = "tSNE2",
hue = "predicted",
data = df,
legend = "full",
alpha = 0.3
)
Mouse Primary Motor Cortex Data Integration using scJoint
import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os
import re
random.seed(1)
output_dir = "data_MOp/output/"
embeddings_file_names = [fn for fn in os.listdir(output_dir)
if any(fn.endswith(ext) for ext in ['_embeddings.txt'])]
embeddings_file_names.sort()
embeddings_file_names
embeddings = np.loadtxt(output_dir + embeddings_file_names[0])
batch = np.repeat(re.sub('_embeddings.txt', '', embeddings_file_names[0]), embeddings.shape[0])
for fn in embeddings_file_names[1:]:
e = np.loadtxt(output_dir + fn)
embeddings = np.append(embeddings, e, axis=0)
batch = np.append(batch, np.repeat(re.sub('_embeddings.txt', '', fn), e.shape[0]))
print(embeddings.shape)
print(batch.shape)
tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)
df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
df['data'] = batch
prediction_file_names = [fn for fn in os.listdir(output_dir)
if any(fn.endswith(ext) for ext in ['_knn_predictions.txt'])]
#print(prediction_file_names)
atac_prediction = np.loadtxt(output_dir + prediction_file_names[0])
methy_prediction = np.loadtxt(output_dir + prediction_file_names[1])
#print(atac_prediction.shape)
#print(methy_prediction.shape)
# Reading the original labels
data_dir = "data_MOp/"
label_file_names = [fn for fn in os.listdir(data_dir)
if any(fn.endswith(ext) for ext in ['cellTypes.txt'])]
label_file_names.sort()
#print(label_file_names)
atac_original = np.loadtxt(data_dir + label_file_names[0])
methy_original = np.loadtxt(data_dir + label_file_names[8])
rna_original = []
for fn in label_file_names[1:8]:
p = np.loadtxt(data_dir + fn)
rna_original = np.append(rna_original, p)
#print(rna_original.shape)
prediction = np.append(atac_prediction, rna_original)
prediction = np.append(prediction, methy_original)
# matching the numeric labels to cell type annotation
label_to_idx = pd.read_csv(data_dir + 'label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
label_dic = np.append(label_dic, label_to_idx[0][i][:-2])
df['predicted'] = label_dic[prediction.astype(int)]
plt.figure(figsize=(10,10))
sns.scatterplot(
x = "tSNE1", y = "tSNE2",
hue = "data",
palette = sns.color_palette("tab10", 9),
data = df.sample(frac=1),
legend = "full",
alpha = 0.3
)
plt.figure(figsize=(10,10))
sns.scatterplot(
x = "tSNE1", y = "tSNE2",
hue = "predicted",
data = df,
legend = "full",
alpha = 0.3
)
生活很好板祝,有你更好