10X單細(xì)胞 && 10XATAC聯(lián)合分析之scJoint

開工第一彈弊琴，我們來看看最新的10X單細(xì)胞聯(lián)合ATAC的分析方法，文章在scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning俩檬，2022年1月發(fā)表于nature biotechnology,IF54分题画，相當(dāng)高了~~~~我們來看一下祭隔，其實(shí)這里要解決的就是多組學(xué)的聯(lián)合分析問題篮灼，下面列舉了一些我之前分享的方法忘古，供大家借鑒和參考。

當(dāng)然了髓堪，10X單細(xì)胞 && 10XATAC有兩種數(shù)據(jù)類型，一種是多組學(xué)，同時(shí)測一個(gè)細(xì)胞內(nèi)的轉(zhuǎn)錄組和ATAC干旁，另外一種是把樣本分成了2份驶沼，一份測轉(zhuǎn)錄組，一份測ATAC争群，大家要根據(jù)自己的情況來甄別一下商乎。

首先我們來分分類，看看這些10X單細(xì)胞 && 10XATAC的聯(lián)合方法的區(qū)別祭阀，以及他們的優(yōu)缺點(diǎn)

manifold alignment（Manifold alignment methods have demonstrated promising results in integrating several modalities from the same tissue. However, requiring distributions to match globally is often too restrictive when different modalities are derived from different tissues and cell types）
matrix factorization (Liger, coupled-NMF)（matrix factorization and correlation-based methods designed for unpaired data require a separate feature selection step before integration for dimension reduction, and the method’s performance is sensitive to which genes are selected）
using correlations to identify nearby cells across modalities(Conos, Seurat)
neural-network approaches（Most existing neural-network methods for multiomics integration are based on autoencoders, which, with a few exceptions,
require paired data. In general, unsupervised training of several autoencoders simultaneously can be very challenging without pairing information across different modalities, with finding a common embedding manifold becomes more difficult as the complexity of the data increases）

因此，當(dāng)前的方法在處理表征多組學(xué)圖譜數(shù)據(jù)的復(fù)雜性和規(guī)模方面的能力有限鲜戒。

scJoint的改進(jìn)之處

一個(gè)新的損失函數(shù)专控，明確地將降維作為轉(zhuǎn)移學(xué)習(xí)中特征工程過程的一部分，允許在整個(gè)訓(xùn)練過程中修改低維特征遏餐，并且無需選擇高度可變的基因伦腐。
當(dāng)細(xì)胞類型不完全重疊時(shí)，增加了模態(tài)對(duì)齊的靈活性的相似性損失
weight sharing across encoders for different modalities resulting in stable training

圖片.png

方法核心

scJoint 的核心是對(duì)標(biāo)記（scRNA-seq）和未標(biāo)記（scATAC-seq）數(shù)據(jù)進(jìn)行協(xié)同訓(xùn)練的半監(jiān)督方法失都，解決了通過共同的低維空間對(duì)齊這兩種不同數(shù)據(jù)模式的主要挑戰(zhàn)柏蘑。 scJoint 包括三個(gè)主要步驟。步驟 1 分別通過新的基于神經(jīng)網(wǎng)絡(luò)的降維 (NNDR) 損失和余弦相似度損失在公共嵌入空間中執(zhí)行聯(lián)合降維和模態(tài)對(duì)齊粹庞。 NNDR 損失在類似于 PCA 的vein中提取具有最大可變性的正交特征咳焚，而余弦相似性損失鼓勵(lì)神經(jīng)網(wǎng)絡(luò)找到嵌入空間的投影，以便可以對(duì)齊兩種模式的大部分庞溜。 scRNA-seq 的嵌入進(jìn)一步由細(xì)胞類型分類損失引導(dǎo)革半，形成半監(jiān)督部分。在步驟 2 中流码，將 scATAC-seq 數(shù)據(jù)中的每個(gè)細(xì)胞視為query又官，通過測量它們?cè)诠睬度肟臻g中的距離來識(shí)別 scRNA-seq 細(xì)胞之間的 k 最近鄰（KNN），并從 scRNA 中轉(zhuǎn)移細(xì)胞類型標(biāo)簽-seq 通過“多數(shù)票”通過 scATAC-seq漫试。在第 3 步中六敬，通過在度量學(xué)習(xí)損失中使用轉(zhuǎn)移標(biāo)簽來進(jìn)一步改進(jìn)兩種模式之間的混合。使用標(biāo)準(zhǔn)工具（包括 tSNE 和 UMAP）從最終嵌入層獲得數(shù)據(jù)集的聯(lián)合可視化驾荣。 scJoint 需要簡單的數(shù)據(jù)預(yù)處理外构，輸入維度等于給定數(shù)據(jù)集中經(jīng)過適當(dāng)過濾后的基因數(shù)。 scATAC-seq 數(shù)據(jù)中的染色質(zhì)可及性首先轉(zhuǎn)換為基因活性評(píng)分秘车，從而允許使用單個(gè)編碼器典勇，RNA 和 ATAC 的權(quán)重共享。

方法比較叮趴，我們簡單看一下

scJoint shows accurate and robust performance on atlas data

圖片.png

Refining scATAC-seq annotations in heterogeneous atlas data

圖片.png

Integration of multimodal data across biological conditions

圖片.png

scJoint shows versatile performance on paired data

圖片.png

當(dāng)然了割笙，文章寫的方法，自然是作者的方法效果最好

核心算法，這個(gè)需要大家自己來解讀了

圖片.png

示例代碼

安裝

git clone https://github.com/SydneyBioX/scJoint.git

scJoint是python版本伤溉，如果分析結(jié)果保存成了rds（R版本）般码，數(shù)據(jù)需要轉(zhuǎn)換一下

library(SingleCellExperiment)
library(DropletUtils)
library(scater)
library(ggplot2)
sce_10xPBMC_atac <- readRDS("data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("data_10x/sce_10xPBMC_rna.rds")
# Only keep common genes between two dataset
common_genes <- intersect(rownames(sce_10xPBMC_atac),
                          rownames(sce_10xPBMC_rna))
length(common_genes)

# Extract the logcounts data from sce object
exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])



source("data_to_h5.R")
write_h5_scJoint(exprs_list = list(rna = exprs_rna,
                                   atac = exprs_atac), 
                 h5file_list = c("data_10x/exprs_10xPBMC_rna.h5", 
                                 "data_10x/exprs_10xPBMC_atac.h5"))
write_csv_scJoint(cellType_list =  list(rna = sce_10xPBMC_rna$cellTypes),
                  csv_list = c("data_10x/cellType_10xPBMC_rna.csv"))

data_to_h5.R這個(gè)軟件有提供，大家可以下載乱顾。

Analysis of PBMC data from 10x Genomics using scJoint

import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed(1)

rna_h5_files = ["data_10x/exprs_10xPBMC_rna.h5"] 
rna_label_files = ["data_10x/cellType_10xPBMC_rna.csv"] # csv file

atac_h5_files = ["data_10x/exprs_10xPBMC_atac.h5"]
atac_label_files = []

process_db.data_parsing(rna_h5_files, atac_h5_files)
rna_label = pd.read_csv(rna_label_files[0], index_col = 0)
rna_label
print(rna_label.value_counts(sort = False))
process_db.label_parsing(rna_label_files, atac_label_files)

Running scJoint

The scRNA-seq and scATAC-seq have 15463 genes in common. And we have 11 cell types in total in the scRNA_seq adta, so we will set the number of class as 11 in the config.py file. The paths also needed to be edited accordingly. Here are the settings for integration of scRNA-seq and scATAC-seq from 10x Genomics we used:

self.number_of_class = 11 # Number of cell types in scRNA-seq data
self.input_size = 15463 # Number of common genes and proteins between scRNA-seq data and scATAC-seq
self.rna_paths = ['data_10x/exprs_10xPBMC_rna.npz'] # RNA gene expression from scRNA-seq data
self.rna_labels = ['data_10x/cellType_10xPBMC_rna.txt'] # scRNA-seq data cell type labels (coverted to numeric)
self.atac_paths = ['data_10x/exprs_10xPBMC_atac.npz'] # ATAC gene activity matrix from scATAC-seq data
self.atac_labels = []
self.rna_protein_paths = []
self.atac_protein_paths = []

Training config

self.batch_size = 256
self.lr_stage1 = 0.01
self.lr_stage3 = 0.01
self.lr_decay_epoch = 20
self.epochs_stage1 = 20
self.epochs_stage3 = 20
self.p = 0.8
self.embedding_size = 64
self.momentum = 0.9
self.center_weight = 1
self.with_crossentorpy = True
self.seed = 1
self.checkpoint = ''
After modifying config.py, we are ready to run scJoint in terminal with
python3 main.py
This takes ~4 minutes using 1 thread for PyTorch.

Visualisation

rna_embeddings = np.loadtxt('./output/exprs_10xPBMC_rna_embeddings.txt')
atac_embeddings = np.loadtxt('./output/exprs_10xPBMC_atac_embeddings.txt')
print(rna_embeddings.shape)
print(atac_embeddings.shape)
embeddings =  np.concatenate((rna_embeddings, atac_embeddings))
print(embeddings.shape)
tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)
tsne_results.shape
df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
rna_labels = np.loadtxt('./data_10x/cellType_10xPBMC_rna.txt')
atac_predictions = np.loadtxt('./output/exprs_10xPBMC_atac_knn_predictions.txt')
labels =  np.concatenate((rna_labels, atac_predictions))
label_to_idx = pd.read_csv('./data_10x/label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])

data_label = np.array(["scRNA-seq", "scATAC-seq"])
df['data'] = np.repeat(data_label, [rna_embeddings.shape[0], atac_embeddings.shape[0]], axis=0)
df['predicted'] = label_dic[labels.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 2),
    data = df,
    legend = "full",
    alpha = 0.3
)

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)

圖片.png

Mouse Primary Motor Cortex Data Integration using scJoint

import process_db
import h5py
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os
import re
random.seed(1)
output_dir = "data_MOp/output/"
embeddings_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_embeddings.txt'])]
embeddings_file_names.sort()
embeddings_file_names
embeddings = np.loadtxt(output_dir + embeddings_file_names[0])
batch = np.repeat(re.sub('_embeddings.txt', '', embeddings_file_names[0]), embeddings.shape[0])
for fn in embeddings_file_names[1:]:
    e = np.loadtxt(output_dir + fn)
    embeddings = np.append(embeddings, e, axis=0)
    batch = np.append(batch, np.repeat(re.sub('_embeddings.txt', '', fn), e.shape[0]))
print(embeddings.shape)
print(batch.shape)

tsne_results = TSNE(perplexity=30, n_iter = 1000).fit_transform(embeddings)

df = pd.DataFrame()
df['tSNE1'] = tsne_results[:,0]
df['tSNE2'] = tsne_results[:,1]
df['data'] = batch

prediction_file_names = [fn for fn in os.listdir(output_dir)
                         if any(fn.endswith(ext) for ext in ['_knn_predictions.txt'])]
#print(prediction_file_names)
atac_prediction = np.loadtxt(output_dir + prediction_file_names[0])
methy_prediction = np.loadtxt(output_dir + prediction_file_names[1])
#print(atac_prediction.shape)
#print(methy_prediction.shape)

# Reading the original labels
data_dir = "data_MOp/"
label_file_names = [fn for fn in os.listdir(data_dir)
                         if any(fn.endswith(ext) for ext in ['cellTypes.txt'])]
label_file_names.sort()
#print(label_file_names)

atac_original = np.loadtxt(data_dir + label_file_names[0])
methy_original = np.loadtxt(data_dir + label_file_names[8])
rna_original = []

for fn in label_file_names[1:8]:
    p = np.loadtxt(data_dir + fn)
    rna_original = np.append(rna_original, p)
    
#print(rna_original.shape)
prediction = np.append(atac_prediction, rna_original)
prediction = np.append(prediction, methy_original)

# matching the numeric labels to cell type annotation
label_to_idx = pd.read_csv(data_dir + 'label_to_idx.txt', sep = '\t', header = None)
label_to_idx.shape
label_dic = []
for i in range(label_to_idx.shape[0]):
    label_dic = np.append(label_dic, label_to_idx[0][i][:-2])
    
df['predicted'] = label_dic[prediction.astype(int)]

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "data",
    palette = sns.color_palette("tab10", 9),
    data = df.sample(frac=1),
    legend = "full",
    alpha = 0.3
)

圖片.png

plt.figure(figsize=(10,10))
sns.scatterplot(
    x = "tSNE1", y = "tSNE2",
    hue = "predicted",
    data = df,
    legend = "full",
    alpha = 0.3
)

圖片.png

生活很好板祝，有你更好

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

禁止轉(zhuǎn)載，如需轉(zhuǎn)載請(qǐng)通過簡信或評(píng)論聯(lián)系作者走净。

人面猴
序言：七十年代末券时，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子伏伯，更是在濱河造成了極大的恐慌橘洞，老刑警劉巖，帶你破解...
沈念sama閱讀 206,126評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件说搅，死亡現(xiàn)場離奇詭異炸枣，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)弄唧，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,254評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門适肠，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人候引，你說我怎么就攤上這事侯养。” “怎么了澄干？”我有些...
開封第一講書人閱讀 152,445評(píng)論 0贊 341
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵沸毁，是天一觀的道長。經(jīng)常有香客問我傻寂，道長息尺，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,185評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任疾掰，我火速辦了婚禮搂誉，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘静檬。我一直安慰自己炭懊，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 64,178評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布拂檩。她就那樣靜靜地躺著侮腹，像睡著了一般。火紅的嫁衣襯著肌膚如雪稻励。梳的紋絲不亂的頭發(fā)上父阻，一...
開封第一講書人閱讀 48,970評(píng)論 1贊 284
城市分裂傳說
那天愈涩，我揣著相機(jī)與錄音，去河邊找鬼加矛。笑死履婉，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的斟览。我是一名探鬼主播毁腿，決...
沈念sama閱讀 38,276評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢(mèng)啊……” “哼苛茂！你這毒婦竟也來了已烤？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 36,927評(píng)論 0贊 259
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤妓羊，失蹤者是張志新（化名）和其女友劉穎草戈，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體侍瑟，經(jīng)...
沈念sama閱讀 43,400評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,883評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年丙猬，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了涨颜。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 37,997評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡茧球，死狀恐怖庭瑰，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情抢埋，我是刑警寧澤弹灭，帶...
沈念sama閱讀 33,646評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布榕暇，位于F島的核電站饱须，受9級(jí)特大地震影響奠骄，放射性物質(zhì)發(fā)生泄漏梧却。R本人自食惡果不足惜秒咐，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,213評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一矛缨、第九天我趴在偏房一處隱蔽的房頂上張望拐叉。院中可真熱鬧毫深，春花似錦酷愧、人聲如沸驾诈。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,204評(píng)論 0贊 19
一樁弒父案溶浴，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽乍迄。三九已至，卻和暖如春士败，著一層夾襖步出監(jiān)牢的瞬間闯两，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,423評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留生蚁，地道東北人噩翠。一個(gè)月前我還...
沈念sama閱讀 45,423評(píng)論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長得像邦投，于是被迫代替她去往敵國和親伤锚。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,722評(píng)論 2贊 345

10X單細(xì)胞 && 10XATAC聯(lián)合分析之scJoint

10X單細(xì)胞 && 10XATAC聯(lián)合分析之scJoint

10X單細(xì)胞轉(zhuǎn)錄組整合诅诱、轉(zhuǎn)錄組 && ATAC整合分析之VIPCCA

10X單細(xì)胞（10X空間轉(zhuǎn)錄組）多組學(xué)（RNA + ATAC）推斷Velovyto（MultiVelo

10X單細(xì)胞 & 10XATAC 聯(lián)合分析表征細(xì)胞調(diào)控網(wǎng)絡(luò)（MIRA)

10X單細(xì)胞 & 10XATAC聯(lián)合之調(diào)控網(wǎng)絡(luò)分析（IReNA）

10X單細(xì)胞（10X空間轉(zhuǎn)錄組）數(shù)據(jù)分析遷移之scGCN

10X單細(xì)胞核轉(zhuǎn)錄組 + 10X單細(xì)胞ATAC的聯(lián)合分析（WNN）

10X單細(xì)胞轉(zhuǎn)錄組與10X單細(xì)胞ATAC聯(lián)合分析之Seurat

首先我們來分分類，看看這些10X單細(xì)胞 && 10XATAC的聯(lián)合方法的區(qū)別祭阀，以及他們的優(yōu)缺點(diǎn)

因此，當(dāng)前的方法在處理表征多組學(xué)圖譜數(shù)據(jù)的復(fù)雜性和規(guī)模方面的能力有限鲜戒。

scJoint的改進(jìn)之處

方法核心

方法比較叮趴，我們簡單看一下

scJoint shows accurate and robust performance on atlas data

Refining scATAC-seq annotations in heterogeneous atlas data

Integration of multimodal data across biological conditions

scJoint shows versatile performance on paired data

當(dāng)然了割笙，文章寫的方法，自然是作者的方法效果最好

核心算法，這個(gè)需要大家自己來解讀了

示例代碼

安裝

scJoint是python版本伤溉，如果分析結(jié)果保存成了rds（R版本）般码，數(shù)據(jù)需要轉(zhuǎn)換一下

data_to_h5.R這個(gè)軟件有提供，大家可以下載乱顾。

Analysis of PBMC data from 10x Genomics using scJoint

Running scJoint

Training config

Visualisation

Mouse Primary Motor Cortex Data Integration using scJoint

推薦閱讀更多精彩內(nèi)容