學(xué)習(xí)資料來源:
- scanpy主頁:https://scanpy.readthedocs.io/en/stable/
- 官網(wǎng):https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html【注意教程有兩個(gè)版本,這里是latest版本的學(xué)習(xí)筆記】
00 環(huán)境準(zhǔn)備
安裝教程:https://scanpy.readthedocs.io/en/stable/installation.html
我這里安裝的版本為:scanpy 1.9.1 是最新版
# linux環(huán)境心包,使用conda安裝軟件
conda create -y -n scanpy python=3
conda activate scanpy
conda install seaborn scikit-learn statsmodels numba pytables -y
conda install -c conda-forge python-igraph leidenalg -y
pip install scanpy
scanpy可以進(jìn)行:?jiǎn)渭?xì)胞數(shù)據(jù)聚類绿满,可視化蚂且,軌跡分析,數(shù)據(jù)整合涮帘,空間數(shù)據(jù)分析等
本次學(xué)習(xí)使用scanpy進(jìn)行單細(xì)胞數(shù)據(jù)基礎(chǔ)分析:分群聚類注釋
01 數(shù)據(jù)準(zhǔn)備
數(shù)據(jù)來自10x官網(wǎng):The data consist of 3k PBMCs from a Healthy Donor and are freely available from 10x Genomics
主要的分析結(jié)果過程與Seurat官網(wǎng)的教程:https://satijalab.org/seurat/articles/pbmc3k_tutorial.html 基本一致豫缨,兩個(gè)軟件的結(jié)果也可以進(jìn)行一下對(duì)比
下載鏈接:https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k
Note:注意數(shù)據(jù)下載鏈接,官網(wǎng)也有不同的數(shù)據(jù)版本赶促,數(shù)據(jù)版本為datasets/1.1.0/pbmc3k
# 官網(wǎng)使用的wget下載數(shù)據(jù),我們這里改用axel多線程命令下載數(shù)據(jù)
mkdir -p data/pbmc3k
cd data/pbmc3k
axel -n 100 http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
# 解壓數(shù)據(jù)
tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
數(shù)據(jù)為10X上游cellranger的標(biāo)準(zhǔn)輸出結(jié)果:
02 數(shù)據(jù)讀取
在python交互環(huán)境中運(yùn)行:
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
import seaborn as sns
# verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.verbosity = 3
sc.logging.print_header()
#sc.settings.set_figure_params(dpi=80, facecolor='white')
adata = sc.read_10x_mtx('data/pbmc3k/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
# this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
adata.var_names_make_unique()
adata
到此挟炬,生成了一個(gè)anndata對(duì)象鸥滨,包括2700個(gè)細(xì)胞,32738基因
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
對(duì)象結(jié)構(gòu)剖析可以參考官方說明:https://anndata.readthedocs.io/en/latest/
03 數(shù)據(jù)預(yù)處理
顯示那些在所有細(xì)胞中谤祖,每個(gè)單細(xì)胞中產(chǎn)生最高Counts的基因
sc.pl.highest_expr_genes(adata, n_top=20,)
plt.savefig("./01-highest_expr_genes.png")
數(shù)據(jù)過濾:
- 細(xì)胞:空的液滴過濾婿滓,每個(gè)細(xì)胞中至少有200個(gè)基因表達(dá)
- 基因:低表達(dá)基因過濾,每個(gè)基因至少在三個(gè)細(xì)胞中表達(dá)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
過濾后的結(jié)果:h還剩下2700個(gè)細(xì)胞 × 13714個(gè)基因
adata
AnnData object with n_obs × n_vars = 2700 × 13714
obs: 'n_genes'
var: 'gene_ids', 'n_cells'
關(guān)于線粒體基因:
線粒體基因含量高的問題(Lun, McCarthy & Marioni, 2017):
高比例表明細(xì)胞質(zhì)量差(Islam et al. 2014; Ilicic et al. 2016) 泊脐,可能是因?yàn)榇┛准?xì)胞失去了細(xì)胞質(zhì) RNA空幻,而線粒體比單個(gè)轉(zhuǎn)錄分子大烁峭,不太可能通過細(xì)胞膜上的裂口逃脫容客。
# 讓我們收集一些關(guān)于線粒體基因的信息,這對(duì)質(zhì)量控制很重要
# annotate the group of mitochondrial genes as 'mt'
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
查看數(shù)據(jù)變化:
adata.var
gene_ids n_cells mt n_cells_by_counts mean_counts pct_dropout_by_counts total_counts
AL627309.1 ENSG00000237683 9 False 9 0.003333 99.666667 9.0
AP006222.2 ENSG00000228463 3 False 3 0.001111 99.888889 3.0
RP11-206L10.2 ENSG00000228327 5 False 5 0.001852 99.814815 5.0
RP11-206L10.9 ENSG00000237491 3 False 3 0.001111 99.888889 3.0
LINC00115 ENSG00000225880 18 False 18 0.006667 99.333333 18.0
... ... ... ... ... ... ... ...
AC145212.1 ENSG00000215750 16 False 16 0.006667 99.407407 18.0
AL592183.1 ENSG00000220023 323 False 323 0.134815 88.037037 364.0
AL354822.1 ENSG00000215615 8 False 8 0.002963 99.703704 8.0
PNRC2-1 ENSG00000215700 110 False 110 0.042963 95.925926 116.0
SRSF10-1 ENSG00000215699 69 False 69 0.025926 97.444444 70.0
[13714 rows x 7 columns]
小提琴繪制三幅質(zhì)控圖:
- 每個(gè)細(xì)胞中表達(dá)的基因個(gè)數(shù)
- 每個(gè)細(xì)胞中的總counts數(shù)
- 每個(gè)細(xì)胞中線粒體基因表達(dá)的比例
# 畫圖
outdir = "./"
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
plt.savefig(outdir + './02-violin_qc.jpg')
可以看到大多數(shù)細(xì)胞表達(dá)的基因個(gè)數(shù)在1000以下接近800個(gè)左右约郁,線粒體百分比在5%以下:
線粒體基因過濾:
# 去除那些線粒體基因表達(dá)過多或總數(shù)過多的細(xì)胞
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
plt.savefig(outdir + './03-pct_counts_mt.jpg')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
plt.savefig(outdir + './03-n_genes_by_counts.jpg')
## 過濾:線粒體百分比<5%缩挑,基因個(gè)數(shù)<2500
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
adata
View of AnnData object with n_obs × n_vars = 2638 × 13714
obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
數(shù)據(jù)標(biāo)準(zhǔn)化:
# Total-count normalize (library-size correct) the data matrix X to 10,000 reads per cell
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
高變基因鑒定:
# Identify highly-variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
plt.savefig(outdir + './04-highly_variable_genes.jpg')
數(shù)據(jù)保存:
adata.raw = adata
# regress_out掉total_counts與mt的影響
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
數(shù)據(jù)歸一化
通過主成分分析(PCA)對(duì)數(shù)據(jù)進(jìn)行降維,揭示數(shù)據(jù)的主要變化軸鬓梅,對(duì)數(shù)據(jù)進(jìn)行降噪處理
# Scale each gene to unit variance. Clip values exceeding standard deviation 10
sc.pp.scale(adata, max_value=10)
04 PCA分析
通過運(yùn)行主成分分析分析(PCA)來降低數(shù)據(jù)的維數(shù)供置,它揭示了變化的主軸,并去除了數(shù)據(jù)的噪音绽快。
# Reduce the dimensionality of the data
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca(adata, color='CST3')
plt.savefig(outdir + './05-PCA.jpg')
# PC可視化
sc.pl.pca_variance_ratio(adata, log=True)
plt.savefig(outdir + './05-pca_variance.jpg')
# 保存結(jié)果
adata.write(outdir + 'pbmc3k.h5ad')
PC貢獻(xiàn)度:
05 聚類
構(gòu)建neighborhood graph芥丧,在這里推薦使用UMAP進(jìn)行聚類可視化:We suggest embedding the graph in two dimensions using UMAP (McInnes et al., 2018)紧阔。
- 它可能比tSNE更忠實(shí)于流形的全局連通性,也就是說续担,它能更好地保存軌跡擅耽。
- 在某些情況下,可能仍然會(huì)觀察到斷開連接的clusters和類似的連接沖突物遇。
## 計(jì)算KNN Graph
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# Embedding the neighborhood graph
sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
plt.savefig(outdir + './06-umap1.jpg')
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)
plt.savefig(outdir + './06-umap2.jpg')
降維聚類可視化:
與Seurat和許多其他框架一樣乖仇,我們推薦Traag et al. (2018)提出的Leiden graph-clustering method(基于優(yōu)化模塊化的社區(qū)檢測(cè))。注意Leiden直接聚類細(xì)胞的鄰域圖询兴,我們已經(jīng)在前一節(jié)計(jì)算過了:
## Clustering the neighborhood graph
# resolution默認(rèn)為1乃沙,這里聚類數(shù)可能跟官網(wǎng)會(huì)有點(diǎn)不一樣
sc.tl.leiden(adata,resolution=0.9)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
plt.savefig(outdir + './06-umap3.jpg',dpi=800,bbox_inches = 'tight')
## save
adata.write(outdir + 'pbmc3k.h5ad')
聚類結(jié)果:這里聚成了9類,官網(wǎng)是8類:
06 鑒定markers
讓我們計(jì)算每個(gè)cluster中差異較大的基因的排名诗舰。為此警儒,默認(rèn)情況下,使用AnnData的.raw屬性以防它之前被初始化眶根。最簡(jiǎn)單冷蚂、最快的方法是t檢驗(yàn):
# Finding marker genes
# t-test
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
plt.savefig(outdir + './07-rank_genes1.jpg',dpi=800,bbox_inches = 'tight')
t檢驗(yàn):
也可以使用秩和檢驗(yàn)wilcoxon: Wilcoxon rank-sum (Mann-Whitney-U)
We recommend using the latter in publications, see e.g., Sonison & Robinson (2018)。
# wilcoxon
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
plt.savefig(outdir + './07-rank_genes2.jpg')
## save
adata.write(outdir + 'pbmc3k.h5ad')
wilcoxon:
當(dāng)然汛闸,還可以考慮更強(qiáng)大的差異測(cè)試包蝙茶,比如 MAST, limma, DESeq2 and, 對(duì)于 python,還有最近的 affxpy
還有另一種選擇诸老,用邏輯回歸對(duì)基因進(jìn)行排序隆夯。例如,這是由Natranos et al. (2018)提出的别伏。本質(zhì)的區(qū)別是蹄衷,這里,我們使用多變量評(píng)估厘肮,而傳統(tǒng)的差分檢驗(yàn)是單變量愧口,Clark et al. (2014)提供更多的細(xì)節(jié)。
# using logistic regression
sc.tl.rank_genes_groups(adata, 'leiden', method='logreg')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
plt.savefig(outdir + './07-rank_genes3.jpg')
logreg:
除IL7R僅通過t檢驗(yàn)和FCER1A僅通過另外兩種鑒定方法發(fā)現(xiàn)外类茂,所有方法均覆蓋了所有的標(biāo)記基因耍属。
官網(wǎng)給的是8個(gè)群的marker,我們需要改成自己對(duì)應(yīng)的9群的一個(gè)結(jié)果:
# Let us also define a list of marker genes for later reference
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
重新加載之前Wilcoxon Rank-Sum test保存的數(shù)據(jù)
adata = sc.read(outdir + 'pbmc3k.h5ad')
# Show the 10 top ranked genes per cluster 0, 1, …, 7厚骗,8 in a dataframe.
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
0 1 2 3 4 5 6 7 8
0 LDHB LYZ CD74 CCL5 NKG7 LST1 MALAT1 HLA-DPA1 PF4
1 RPS12 S100A9 CD79A NKG7 GNLY FCER1G RPL32 HLA-DPB1 SDPR
2 RPS25 S100A8 HLA-DRA CST7 GZMB FCGR3A RPL27A HLA-DRA GNG11
3 CD3D TYROBP CD79B B2M CTSW AIF1 RPS27 HLA-DRB1 PPBP
4 RPS3 FTL HLA-DPB1 GZMA PRF1 COTL1 RPL13 CD74 NRGN
得到一個(gè)分?jǐn)?shù)和組表:
# Get a table with the scores and groups.
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
{group + '_' + key[:1]: result[key][group]
for group in groups for key in ['names', 'pvals']}).head(5)
與單個(gè)cluster進(jìn)行比較:0 vs 1
# Compare to a single cluster:
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
plt.savefig(outdir+'./08-marker1.jpg')
更詳細(xì)的可視化:0 vs 1
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
plt.savefig(outdir+'./08-marker_violin.jpg')
加載之前的差異分析結(jié)果:看0 vs rest
adata = sc.read(outdir + 'pbmc3k.h5ad')
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
plt.savefig(outdir+'./09-marker_violin.jpg')
如果你想在不同cluster之間比較某個(gè)基因,請(qǐng)使用以下方法:
# compare a certain gene across groups
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
plt.savefig(outdir+'./09-marker_violin1.jpg',dpi=800,bbox_inches = 'tight')
'CST3', 'NKG7', 'PPBP'的結(jié)果:
定義細(xì)胞群:
# 首先看不同的marker在亞群中的表達(dá)情況:
# visualize the marker genes
sc.pl.dotplot(adata, marker_genes, groupby='leiden')
plt.savefig(outdir+'./09-marker_dotplot.jpg',dpi=800,bbox_inches = 'tight')
手動(dòng)注釋亞群:這里有個(gè)點(diǎn)兢哭,不支持多個(gè)cluster是同一種細(xì)胞類型的格式领舰,
后來找到了解決辦法,參考:https://mp.weixin.qq.com/s/cevuK4znyWLND4t5X-xrfA
# mark the cell types
new_cluster_names = [
'CD4 T',
'CD14 Monocytes',
'B',
'CD8 T',
'NK',
'FCGR3A Monocytes',
'CD4 T',
'Dendritic',
'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
#ValueError: Categorical categories must be unique
#不支持多個(gè)cluster是同一種細(xì)胞類型的格式,換一種方式添加細(xì)胞類型
# 換一種方式添加細(xì)胞類型
new_cluster_names_unique=list(set(new_cluster_names))
new_cluster_names_unique.sort(key=new_cluster_names.index)
celltype = [new_cluster_names[int(i)] for i in adata.obs.leiden.values.tolist()]
adata.obs.leiden = celltype
adata.obs.leiden = adata.obs.leiden.astype('category')
adata.obs.leiden.cat.set_categories(new_cluster_names_unique, inplace=True)
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False)
plt.savefig(outdir+'./09-marker_annotation.jpg',dpi=1000,bbox_inches = 'tight')
注釋之后的結(jié)果:
marker基因可視化:
## 注釋之后的marker可視化
sc.pl.dotplot(adata, marker_genes, groupby='leiden')
plt.savefig(outdir+'./09-marker_annotation-dotplot.jpg',dpi=1000,bbox_inches = 'tight')
# 小提琴圖
sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90);
plt.savefig(outdir+'./09-marker_annotation-violin.jpg',dpi=1000,bbox_inches = 'tight')
氣泡圖:
小提琴圖:
最后數(shù)據(jù)保存:
## 最終結(jié)果保存
# `compression='gzip'` saves disk space, but slows down writing and subsequent reading
adata.write(outdir + 'pbmc3k.h5ad', compression='gzip')