簡介
PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) 是Krishnaswamy實(shí)驗(yàn)室開發(fā)的一種用于可視化具有自然進(jìn)程或軌跡的高維單細(xì)胞數(shù)據(jù)的工具捡偏。PHATE 使用一種新穎的概念框架來學(xué)習(xí)和可視化生物系統(tǒng)中固有的流形竞川。其中逆济,平滑過渡標(biāo)志著細(xì)胞從一種狀態(tài)到另一種狀態(tài)的進(jìn)展。PHATE生成一個(gè)低維嵌入柒巫,使用數(shù)據(jù)點(diǎn)之間的幾何距離信息(即:樣品表達(dá)譜之間的距離)來捕獲數(shù)據(jù)集中的局部和全局非線性結(jié)構(gòu)。
目前該文章發(fā)表在Nature Biotechnology頂級(jí)期刊上:Visualizing Structure and Transitions in High-dimensional Biological Data. 2019. Nature Biotechnology.
以下是它的原理圖:
數(shù)據(jù)簡介
在本教程中蚌吸,我們將演示如何使用 PHATE
來分析胚胎體 (EB) 在27天不同時(shí)間點(diǎn)分化的 31,000 個(gè)單細(xì)胞的數(shù)據(jù)类少。運(yùn)行本教程大約需要 15 分鐘(不包括 t-SNE 比較),或 25 分鐘(包括比較)。
人胚胎體不同時(shí)間點(diǎn)的分化進(jìn)程
首先召嘶,將低代 H1 hESCs 細(xì)胞培養(yǎng)在基質(zhì)膠涂層的培養(yǎng)皿中父晶,并在 DMEM/F12-N2B27 培養(yǎng)基中補(bǔ)充 FGF2成分。對(duì)于 EB 的形成弄跌,細(xì)胞用 Dispase 處理,解離成小團(tuán)塊并鋪在非粘附板中尝苇,培養(yǎng)基中補(bǔ)充有 20% FBS铛只。在為期 27 天的分化過程中,每隔 3 天收集一次樣品糠溜,還包括未分化的 hESC 樣本淳玩。通過 qPCR 驗(yàn)證這些 EB 培養(yǎng)物中關(guān)鍵胚層標(biāo)記基因的表達(dá)。對(duì)于單細(xì)胞文庫的構(gòu)建非竿,將EB 培養(yǎng)物分離蜕着,F(xiàn)ACS 分選以去除雙聯(lián)體細(xì)胞和死細(xì)胞,并使用10x Genomics儀器生成 cDNA 文庫红柱,然后對(duì)其進(jìn)行測序承匣。
安裝PHATE及其依賴包
- 使用pip直接進(jìn)行安裝
pip install --user phate
#安裝MAGIC和scpre依賴包
pip install --user magic-impute
pip install --user scpre
- 下載源碼進(jìn)行安裝
git clone --recursive git://github.com/KrishnaswamyLab/PHATE.git
cd PHATE/Python
python setup.py install --user
下載 10x 單細(xì)胞示例數(shù)據(jù)
EB示例數(shù)據(jù)集scRNAseq.zip
可以從https://data.mendeley.com/datasets/v6n743h5ng/ Mendelay Datasets 中下載使用。scRNA-seq文件夾中包含有5個(gè)子目錄锤悄,每個(gè)子目錄中包含有CellRanger處理生成的三個(gè)文件:barcodes.tsv
韧骗,genes.tsv
,和matrix.mtx
零聚。
# 導(dǎo)入所需的python包
import os
import scprep
download_path = os.path.expanduser("~")
print(download_path)
>>> /home/user
# 下載數(shù)據(jù)
if not os.path.isdir(os.path.join(download_path, "scRNAseq", "T0_1A")):
# need to download the data
scprep.io.download.download_and_extract_zip(
"https://md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com/"
"5739738f-d4dd-49f7-b8d1-5841abdbeb1e",
download_path)
這是目錄的結(jié)構(gòu):
使用scpre包導(dǎo)入數(shù)據(jù)并進(jìn)行數(shù)據(jù)質(zhì)控預(yù)處理
我們使用scprep
包來進(jìn)行數(shù)據(jù)的導(dǎo)入袍暴, 通過scprep.io.load_10X
函數(shù)將會(huì)自動(dòng)加載 10x scRNAseq 數(shù)據(jù)集到Pandas的 DataFrame 類型中。當(dāng)然隶症,我們也可以使用scprep.io.load_tsv()
函數(shù)直接讀取基因表達(dá)矩陣的TSV文件政模。
注意:默認(rèn)情況下,scprep.io.load_10X
函數(shù)使用 Pandas 的 SparseDataFrame 格式(參見 Pandas 文檔)加載 scRNA-seq 數(shù)據(jù)以最大化內(nèi)存效率蚂会。但是淋样,這將比加載dense矩陣的速度要慢一些。如果要加載dense矩陣颂龙,可以將sparse=False
參數(shù)傳遞給load_10X
函數(shù)习蓬。同時(shí),我們還設(shè)置gene_labels = 'both'
參數(shù)措嵌,這樣就可以同時(shí)保留基因symbol和基因 ID 的信息躲叼。
# 導(dǎo)入所需的python包
import pandas as pd
import numpy as np
import phate
import scprep
# matplotlib settings for Jupyter notebooks only
%matplotlib inline
# 使用load_10X函數(shù)分別導(dǎo)入每個(gè)樣本文件
sparse=True
T1 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T0_1A"), sparse=sparse, gene_labels='both')
T2 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T2_3B"), sparse=sparse, gene_labels='both')
T3 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T4_5C"), sparse=sparse, gene_labels='both')
T4 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T6_7D"), sparse=sparse, gene_labels='both')
T5 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T8_9E"), sparse=sparse, gene_labels='both')
# 查看數(shù)據(jù)信息
T1.head()
文庫數(shù)據(jù)質(zhì)量控制
首先,我們會(huì)過濾掉具有非常大或非常小的library size的細(xì)胞企巢。對(duì)于這個(gè)數(shù)據(jù)集枫慷,文庫的大小與樣本有一定的相關(guān)性,因此我們會(huì)對(duì)每個(gè)樣本單獨(dú)進(jìn)行過濾。這里或听,我們使用filter_library_size()
函數(shù)過濾掉每個(gè)樣本頂部和底部 20% 的細(xì)胞探孝。
# 查看文庫大小分布
scprep.plot.plot_library_size(T1, percentile=20)
plt.show()
# 分別對(duì)每個(gè)文庫進(jìn)行過濾
filtered_batches = []
for batch in [T1, T2, T3, T4, T5]:
batch = scprep.filter.filter_library_size(batch, percentile=20, keep_cells='above')
batch = scprep.filter.filter_library_size(batch, percentile=75, keep_cells='below')
filtered_batches.append(batch)
del T1, T2, T3, T4, T5 # removes objects from memory
合并所有樣本的數(shù)據(jù)
接下來,我們將使用combine_batches()
函數(shù)合并所有樣本的數(shù)據(jù)誉裆,并構(gòu)建一個(gè)表示每個(gè)樣本時(shí)間點(diǎn)的向量顿颅。
EBT_counts, sample_labels = scprep.utils.combine_batches(
filtered_batches,
["Day 00-03", "Day 06-09", "Day 12-15", "Day 18-21", "Day 24-27"],
append_to_cell_names=True
)
del filtered_batches # removes objects from memory
EBT_counts.head()
去除稀有基因
接下來,我們會(huì)去除那些在 10 個(gè)或更少細(xì)胞中表達(dá)的基因足丢。
EBT_counts = scprep.filter.filter_rare_genes(EBT_counts, min_cells=10)
數(shù)據(jù)歸一化
為了校正文庫大小的差異粱腻,我們使用library_size_normalize()
函數(shù)將每個(gè)細(xì)胞除以其庫大小,然后按文庫大小的中值進(jìn)行重新縮放斩跌。
EBT_counts = scprep.normalize.library_size_normalize(EBT_counts)
去除線粒體含量較高的死細(xì)胞
通常绍些,死細(xì)胞中線粒體 RNA 的表達(dá)水平可能會(huì)高于正常活細(xì)胞耀鸦。因此柬批,我們通過消除平均線粒體 RNA 表達(dá)水平最高的細(xì)胞來去除可疑的死細(xì)胞。
首先讓我們看看線粒體基因的分布袖订。
mito_genes = scprep.select.get_gene_set(EBT_counts, starts_with="MT-") # Get all mitochondrial genes. There are 14, FYI.
scprep.plot.plot_gene_set_expression(EBT_counts, genes=mito_genes, percentile=90)
plt.show()
這里氮帐,我們看到在前 90% 以上,線粒體 RNA 的表達(dá)急劇增加著角。因此揪漩,我們將在后續(xù)的分析中刪除這些細(xì)胞。
數(shù)據(jù)轉(zhuǎn)換
這里吏口,我們將使用scprep.transform.sqrt()
函數(shù)進(jìn)行平方根轉(zhuǎn)換奄容。
EBT_counts = scprep.transform.sqrt(EBT_counts)
使用PHATE進(jìn)行低維嵌入降維可視化
首先,我們實(shí)例化一個(gè) PHATE 估計(jì)器對(duì)象产徊,用于將 PHATE 低維嵌入擬合到給定數(shù)據(jù)集中昂勒。接下來,使用fit
和fit_transform
函數(shù)生成低維嵌入舟铜。有關(guān)更多信息戈盈,請(qǐng)查看PHATE 閱讀文檔。
這里谆刨,我們將使用默認(rèn)參數(shù)塘娶,但可以調(diào)整以下參數(shù)(閱讀我們?cè)?a target="_blank">phate.readthedocs.io上的文檔以了解更多信息):
-
knn
:最近鄰居的數(shù)量(默認(rèn)值:5)。如果得到的 PHATE 低維嵌入看起來非常不連貫痊夭,我們可以增加此值(例如刁岸,增加到 20)。如果您的數(shù)據(jù)集非常大(例如 >100k 細(xì)胞)她我,您還應(yīng)該考慮增加該值虹曙。 -
decay
:Alpha 衰減值(默認(rèn)值:15)迫横。減少decay
值可以增加knn圖的連通性,增加decay
值會(huì)降低圖的連通性酝碳。通常情況下矾踱,很少需要調(diào)整該值。 -
t
:Number of times to power the operator(默認(rèn)值:'auto')疏哗。這相當(dāng)于對(duì)數(shù)據(jù)進(jìn)行平滑調(diào)整的量呛讲。默認(rèn)情況下會(huì)自動(dòng)選擇,但如果您的低維嵌入缺乏結(jié)構(gòu)返奉,可以考慮增加它圣蝎,或者如果結(jié)構(gòu)看起來太緊湊,則減少它衡瓶。 -
gamma
:Informational distance constant(默認(rèn)值:1)。gamma=1
gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can trygamma=0
牲证。
這是應(yīng)用 PHATE 的最簡單方法:
# 實(shí)例化phate估計(jì)器對(duì)象
phate_operator = phate.PHATE(n_jobs=-2)
# 使用fit_transform函數(shù)進(jìn)行低維嵌入擬合
Y_phate = phate_operator.fit_transform(EBT_counts)
Calculating PHATE...
Running PHATE on 16821 cells and 17845 genes.
Calculating graph and diffusion operator...
Calculating PCA...
Calculated PCA in 32.16 seconds.
Calculating KNN search...
Calculated KNN search in 14.15 seconds.
Calculating affinities...
Calculated affinities in 0.26 seconds.
Calculated graph and diffusion operator in 52.76 seconds.
Calculating landmark operator...
Calculating SVD...
Calculated SVD in 2.15 seconds.
Calculating KMeans...
Calculated KMeans in 36.05 seconds.
Calculated landmark operator in 40.53 seconds.
Calculating optimal t...
Automatically selected t = 21
Calculated optimal t in 2.98 seconds.
Calculating diffusion potential...
Calculated diffusion potential in 2.12 seconds.
Calculating metric MDS...
Calculated metric MDS in 13.67 seconds.
Calculated PHATE in 112.09 seconds.
接下來哮针,我們可以使用scprep.plot.scatter2d()
函數(shù)對(duì) PHATE 低維嵌入后的結(jié)果進(jìn)行可視化展示。
scprep.plot.scatter2d(Y_phate, c=sample_labels, figsize=(12,8), cmap="Spectral",
ticks=False, label_prefix="PHATE")
plt.show()
由于我們希望尋找出EB在不同時(shí)間點(diǎn)分化過程中的微妙結(jié)構(gòu)坦袍,并且我們期望一些分化軌跡是稀疏的十厢。因此,我們可以將knn值從默認(rèn)的 5 減少捂齐,并將t值從自動(dòng)檢測的21減少(在上面的輸出結(jié)果中)蛮放。對(duì)于單細(xì)胞 RNA-seq數(shù)據(jù),如果您想尋找數(shù)據(jù)集中微妙結(jié)構(gòu)的變化奠宜,可以嘗試將knn值降低至 3 或 4包颁,如果您有數(shù)十萬個(gè)細(xì)胞,則可以嘗試將該值增加到30 或 40压真。此處娩嚼,我們還將alpha值減少到 15。
phate_operator.set_params(knn=4, decay=15, t=12)
# We could also create a new operator:
# phate_operator = phate.PHATE(knn=4, decay=15, t=12, n_jobs=-2)
Y_phate = phate_operator.fit_transform(EBT_counts)
Calculating PHATE...
Running PHATE on 16821 cells and 17845 genes.
Calculating graph and diffusion operator...
Calculating PCA...
Calculated PCA in 30.84 seconds.
Calculating KNN search...
Calculated KNN search in 17.06 seconds.
Calculating affinities...
Calculated affinities in 8.93 seconds.
Calculated graph and diffusion operator in 62.44 seconds.
Calculating landmark operator...
Calculating SVD...
Calculated SVD in 10.38 seconds.
Calculating KMeans...
Calculated KMeans in 35.15 seconds.
Calculated landmark operator in 47.79 seconds.
Calculating diffusion potential...
Calculated diffusion potential in 1.45 seconds.
Calculating metric MDS...
Calculated metric MDS in 7.24 seconds.
Calculated PHATE in 118.93 seconds.
scprep.plot.scatter2d(Y_phate, c=sample_labels, figsize=(12,8), cmap="Spectral",
ticks=False, label_prefix="PHATE")
plt.show()
當(dāng)然滴肿,我們還可以使用scprep.plot.scatter3d()
函數(shù)進(jìn)行三維可視化展示岳悟。
phate_operator.set_params(n_components=3)
Y_phate_3d = phate_operator.transform()
scprep.plot.scatter3d(Y_phate_3d, c=sample_labels, figsize=(8,6), cmap="Spectral",
ticks=False, label_prefix="PHATE")
plt.show()
我們甚至還可以生成一個(gè)3D旋轉(zhuǎn)的動(dòng)態(tài)gif圖。
scprep.plot.rotate_scatter3d(Y_phate_3d, c=sample_labels,
figsize=(8,6), cmap="Spectral",
ticks=False, label_prefix="PHATE")
# to save as a gif:
# >>> scprep.plot.rotate_scatter3d(Y_phate_3d, c=sample_labels,
# ... figsize=(8,6), cmap="Spectral",
# ... ticks=False, label_prefix="PHATE", filename="phate.gif")
# to save as an mp4:
# >>> scprep.plot.rotate_scatter3d(Y_phate_3d, c=sample_labels,
# ... figsize=(8,6), cmap="Spectral",
# ... ticks=False, label_prefix="PHATE", filename="phate.mp4")
與其他降維可視化工具的比較
接下來泼差,我們將對(duì)PCA和t-SNE降維可視化的結(jié)果進(jìn)行比較贵少。
import sklearn.decomposition # PCA
import sklearn.manifold # t-SNE
import time
start = time.time()
# PCA降維可視化
pca_operator = sklearn.decomposition.PCA(n_components=2)
Y_pca = pca_operator.fit_transform(np.array(EBT_counts))
end = time.time()
print("Embedded PCA in {:.2f} seconds.".format(end-start))
start = time.time()
pca_operator = sklearn.decomposition.PCA(n_components=100)
# tSNE降維可視化
tsne_operator = sklearn.manifold.TSNE(n_components=2)
Y_tsne = tsne_operator.fit_transform(pca_operator.fit_transform(np.array(EBT_counts)))
end = time.time()
print("Embedded t-SNE in {:.2f} seconds.".format(end-start))
# plot everything
import matplotlib.pyplot as plt
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(16, 5))
#plotting PCA
scprep.plot.scatter2d(Y_pca, label_prefix="PC", title="PCA",
c=sample_labels, ticks=False, cmap='Spectral', ax=ax1)
#plotting tSNE
scprep.plot.scatter2d(Y_tsne, label_prefix="t-SNE", title="t-SNE", legend=False,
c=sample_labels, ticks=False, cmap='Spectral', ax=ax2)
#plotting PHATE
scprep.plot.scatter2d(Y_phate, label_prefix="PHATE", title="PHATE", legend=False,
c=sample_labels, ticks=False, cmap='Spectral', ax=ax3)
plt.tight_layout()
plt.show()
可以看到,PHATE可以很好的展示出EB在不同分化時(shí)間點(diǎn)的連續(xù)變化狀態(tài)堆缘,同時(shí)也能保留它們?nèi)值恼w結(jié)構(gòu)滔灶。