scanpy軟件官網(wǎng):https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html
這個(gè)軟件的官網(wǎng)分了幾個(gè)部分進(jìn)行介紹,每一個(gè)部分的練習(xí)數(shù)據(jù)都不一樣,這一部分的練習(xí)數(shù)據(jù)下載地址:http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
首先下載數(shù)據(jù):
$ wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
$ tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
然后安裝scanpy:
$ pip install scanpy
進(jìn)入python調(diào)用潜慎,調(diào)用不出錯(cuò)就是安裝好了:
>>> import scanpy as sc
如果調(diào)用的時(shí)候報(bào)錯(cuò)谦絮,告訴你缺少什么tqdm.auto之類的,你可以這樣:
#退出python,輸入下面的代碼:
$ pip uninstall tqdm #先卸載
$ pip install tqdm #再安裝
準(zhǔn)備數(shù)據(jù)
#示例數(shù)據(jù)來(lái)自健康人的3千個(gè)PBMC細(xì)胞,測(cè)序平臺(tái)是10x Genomics
#調(diào)用軟件
>>> import numpy as np
>>> import pandas as pd
>>> import scanpy as sc
#先設(shè)置一個(gè)h5ad的文件,這個(gè)文件用來(lái)保存我們一會(huì)兒分析的結(jié)果
>>> results_file = './write/pbmc3k.h5ad'
#讀取數(shù)據(jù)
>>> adata = sc.read_10x_mtx(
'./filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
>>> adata.var_names_make_unique() #如果你上一步用的是`var_names='gene_ids'憔辫,你就不用做這一步
#看一下adata
>>> adata
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
說(shuō)明數(shù)據(jù)里有2700個(gè)細(xì)胞,32738個(gè)基因仿荆。
數(shù)據(jù)預(yù)處理
#看一下top20基因的表達(dá)情況
>>> sc.pl.highest_expr_genes(adata,n_top=20)
#過(guò)濾基因和細(xì)胞
>>> sc.pp.filter_cells(adata, min_genes=200)
>>> sc.pp.filter_genes(adata, min_cells=3)
>>> adata
AnnData object with n_obs × n_vars = 2700 × 13714
obs: 'n_genes'
var: 'gene_ids', 'n_cells'
#過(guò)濾完基因少了很多
#找出線粒體基因
>>> mito_genes = adata.var_names.str.startswith('MT-')
>>> adata.obs['percent_mito'] = np.sum(
adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
#計(jì)算每個(gè)細(xì)胞的總count數(shù)
>>> adata.obs['n_counts'] = adata.X.sum(axis=1).A1
#用小提琴圖將質(zhì)量信息可視化
>>> sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4, multi_panel=True)
還可以換一種方式可視化:
>>> sc.pl.scatter(adata, x='n_counts', y='percent_mito')
>>> sc.pl.scatter(adata, x='n_counts', y='n_genes')
下面這張圖就是線粒體的含量贰您,整體還是不錯(cuò)的,沒(méi)有特別大的異常值:
下圖是細(xì)胞和基因的關(guān)系拢操,一般是線性關(guān)系锦亦,斜率越大越好,說(shuō)明我們可以用較少的細(xì)胞測(cè)到較多的基因:
下面進(jìn)行進(jìn)一步的過(guò)濾:
>>> adata = adata[adata.obs.n_genes < 2500, :]
>>> adata = adata[adata.obs.percent_mito < 0.05, :]
>>> adata
View of AnnData object with n_obs × n_vars = 2638 × 13714
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells'
上面的AnnData對(duì)象有點(diǎn)像三大R包的對(duì)象令境,長(zhǎng)這個(gè)樣子:
具體的看一下對(duì)象里都有什么:
>>> adata.obs #每一個(gè)barcode對(duì)應(yīng)的基因數(shù)杠园,線粒體基因比例,檢測(cè)到的count數(shù)
n_genes percent_mito n_counts
AAACATACAACCAC-1 781 0.030178 2419.0
AAACATTGAGCTAC-1 1352 0.037936 4903.0
AAACATTGATCAGC-1 1131 0.008897 3147.0
AAACCGTGCTTCCG-1 960 0.017431 2639.0
AAACCGTGTATGCG-1 522 0.012245 980.0
... ... ... ...
TTTCGAACTCTCAT-1 1155 0.021104 3459.0
TTTCTACTGAGGCA-1 1227 0.009294 3443.0
TTTCTACTTCCTCG-1 622 0.021971 1684.0
TTTGCATGAGAGGC-1 454 0.020548 1022.0
TTTGCATGCCTCAC-1 724 0.008065 1984.0
[2638 rows x 3 columns]
>>> adata.var #每一個(gè)基因在多少個(gè)細(xì)胞里表達(dá)
gene_ids n_cells
AL627309.1 ENSG00000237683 9
AP006222.2 ENSG00000228463 3
RP11-206L10.2 ENSG00000228327 5
RP11-206L10.9 ENSG00000237491 3
LINC00115 ENSG00000225880 18
... ... ...
AC145212.1 ENSG00000215750 16
AL592183.1 ENSG00000220023 323
AL354822.1 ENSG00000215615 8
PNRC2-1 ENSG00000215700 110
SRSF10-1 ENSG00000215699 69
[13714 rows x 2 columns]
關(guān)于AnnData對(duì)象的更多具體細(xì)節(jié)請(qǐng)看:單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析|| scanpy教程:預(yù)處理與聚類
數(shù)據(jù)標(biāo)準(zhǔn)化:
#標(biāo)準(zhǔn)化
>>> sc.pp.normalize_total(adata, target_sum=1e4)
#log標(biāo)準(zhǔn)化后的值
>>> sc.pp.log1p(adata)
#將標(biāo)準(zhǔn)化后的數(shù)值存為.raw屬性舔庶,方便后續(xù)分析
>>> adata.raw = adata
鑒定高變基因:
#鑒定高變基因:
>>> sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
#可視化高變基因
>>> sc.pl.highly_variable_genes(adata)
上圖黑色的點(diǎn)是高變基因抛蚁,其他基因是灰色點(diǎn)。
#只留下高變基因進(jìn)行后續(xù)分析
>>> adata = adata[:, adata.var.highly_variable]
>>> adata
View of AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p'
#回歸每個(gè)細(xì)胞的總計(jì)數(shù)和線粒體基因表達(dá)百分比的影響惕橙,將數(shù)據(jù)縮放到單位方差瞧甩。
>>> sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
#scale數(shù)據(jù)
>>> sc.pp.scale(adata, max_value=10)
降維
PCA降維:
#PCA降維
>>> sc.tl.pca(adata, svd_solver='arpack')
#PCA可視化
>>> sc.pl.pca(adata, color='CST3')
也可以用碎石圖來(lái)決定用多少個(gè)PC來(lái)進(jìn)行臨近細(xì)胞的計(jì)算:
sc.pl.pca_variance_ratio(adata, log=True)
保存結(jié)果:
>>> adata.write(results_file)
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca'
obsm: 'X_pca'
varm: 'PCs'
聚類
聚類前先計(jì)算neighborhood graph,先用默認(rèn)值來(lái)計(jì)算一下:
>>> sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
將neighborhood graph嵌入:
>>> sc.tl.umap(adata)
>>> sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
如果我將上面的n_pcs=40換成n_pcs=8弥鹦,(這個(gè)值根據(jù)上面的碎石圖來(lái)肚逸,選拐點(diǎn)的數(shù)字)圖就不一樣了:
#neighborhood graph聚類
>>> sc.tl.leiden(adata)
>>> sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
>>> adata.write("umap.h5ad")
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
尋找Marker基因
給每一個(gè)cluster計(jì)算top差異基因:
(1)你可以使用t-test方法計(jì)算:
>>> sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
>>> sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False,fontsize=5)
>>> sc.settings.verbosity = 2 # reduce the verbosity
(2)你也可以用另一種方法來(lái)計(jì)算(推薦):
>>> sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
>>> sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False,fontsize=5)
然后可以看一下隨便哪個(gè)cluster里的差異基因:
>>> sc.get.rank_genes_groups_df(adata, group="0")
scores names logfoldchanges pvals pvals_adj
0 32.783016 S100A9 7.394584 1.028132e-235 8.518724e-232
1 32.777248 LYZ 6.225223 1.242340e-235 8.518724e-232
2 32.260483 S100A8 7.533852 2.508019e-228 1.146499e-224
3 30.617512 TYROBP 5.391633 7.157181e-206 2.453839e-202
4 29.973402 FCN1 5.507009 2.180708e-197 5.981246e-194
.. ... ... ... ... ...
95 12.487227 LY86 2.596598 8.765488e-36 7.285449e-34
96 12.318965 C20orf24 1.944652 7.160747e-35 5.676444e-33
97 12.233775 CNPY3 2.075463 2.051759e-34 1.617116e-32
98 12.183536 TCEB2 1.488625 3.804261e-34 2.981236e-32
99 12.136163 ASGR1 5.176166 6.793834e-34 5.293786e-32
[100 rows x 5 columns]
查看所有cluster里的top10基因:
>>> pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(10)
0 1 2 3 4 5 ... 7 8 9 10 11 12
0 S100A9 CD74 LDHB RPL32 RPS12 CCL5 ... LST1 IL32 NKG7 NKG7 HLA-DPA1 PF4
1 LYZ CD79A LTB RPS12 RPS6 GZMK ... FCER1G CD3D GNLY CCL5 HLA-DPB1 GNG11
2 S100A8 HLA-DRA CD3D RPS6 RPL13 NKG7 ... COTL1 LDHB GZMB GZMH HLA-DRB1 SDPR
3 TYROBP CD79B IL32 RPS27 RPS3A IL32 ... AIF1 LTB CTSW CST7 HLA-DRA PPBP
4 FCN1 HLA-DPB1 TMEM66 RPS3 RPL3 GZMA ... FTH1 CD3E PRF1 B2M CD74 NRGN
5 FTL MS4A1 IL7R RPS14 RPS3 CTSW ... IFITM3 B2M CST7 HLA-C CST3 SPARC
6 CST3 HLA-DQA1 JUNB RPL13 RPL32 B2M ... SAT1 RPS3 GZMA HLA-A HLA-DRB5 GPX1
7 LGALS2 HLA-DRB1 TPT1 RPL21 RPS14 CST7 ... FTL RPS27 HLA-C GZMA HLA-DQA1 TPM4
8 S100A6 HLA-DQB1 GIMAP7 RPS25 RPS18 HLA-C ... PSAP RPS25 FGFBP2 CD3D HLA-DQB1 RGS18
9 FTH1 CD37 CD3E RPL31 EEF1A1 LYAR ... CTSS HLA-A B2M FGFBP2 FCER1A CALM3
將top10基因和它們的pval同時(shí)展示:
>>> result = adata.uns['rank_genes_groups']
>>> groups = result['names'].dtype.names
>>> pd.DataFrame(
... {group + '_' + key[:1]: result[key][group]
... for group in groups for key in ['names', 'pvals']}).head(10)
0_n 0_p 1_n 1_p ... 11_n 11_p 12_n 12_p
0 S100A9 1.028132e-235 CD74 3.599677e-184 ... HLA-DPA1 1.097649e-19 PF4 4.722886e-10
1 LYZ 1.242340e-235 CD79A 4.507668e-171 ... HLA-DPB1 7.563026e-19 GNG11 4.733899e-10
2 S100A8 2.508019e-228 HLA-DRA 5.186291e-168 ... HLA-DRB1 2.189338e-18 SDPR 4.733899e-10
3 TYROBP 7.157181e-206 CD79B 5.747138e-156 ... HLA-DRA 6.262321e-18 PPBP 4.744938e-10
4 FCN1 2.180708e-197 HLA-DPB1 1.257935e-147 ... CD74 2.286726e-17 NRGN 4.800511e-10
5 FTL 7.485109e-192 MS4A1 1.334052e-140 ... CST3 2.883852e-17 SPARC 4.947990e-10
6 CST3 5.932796e-191 HLA-DQA1 3.534036e-140 ... HLA-DRB5 6.818970e-17 GPX1 4.947990e-10
7 LGALS2 9.551453e-189 HLA-DRB1 1.510260e-130 ... HLA-DQA1 1.022928e-16 TPM4 5.159513e-10
8 S100A6 3.253730e-188 HLA-DQB1 6.020741e-130 ... HLA-DQB1 5.566163e-16 RGS18 5.195614e-10
9 FTH1 4.782225e-180 CD37 6.463741e-130 ... FCER1A 1.482060e-15 CALM3 6.197000e-10
[10 rows x 26 columns]
你還可以單獨(dú)比較某兩個(gè)cluster的差異基因:
#比如cluster 0和1:
>>> sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
>>> sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
再看一下小提琴圖:
>>> sc.pl.rank_genes_groups_violin(adata, groups=['0'], n_genes=20)
重新load數(shù)據(jù),可以看cluster 0和其他的cluster的比較:
>>> adata=sc.read("./write/pbmc3k.h5ad")
>>> sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=20)
比較所有cluster里某些特定的基因:
>>> sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
NOTE:下一步將細(xì)胞名稱先存成一個(gè)列表(這一步官網(wǎng)的cluster數(shù)量和我的不一樣,所以就先不做)朦促,把聚類的細(xì)胞注釋:
#你有多少個(gè)cluster就要輸入多少個(gè)名稱
>>> new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes']
>>> adata.rename_categories('leiden', new_cluster_names)
>>> sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')
可視化每一個(gè)cluster的marker基因:
>>> marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
>>> ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
上面是點(diǎn)圖犬钢,再畫個(gè)小提琴圖看一下:
>>> ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)
這一部分就告一段落,主要是簡(jiǎn)單的介紹一下scanpy的數(shù)據(jù)預(yù)處理和聚類思灰。我們可以回顧一下adata這個(gè)對(duì)象里現(xiàn)在都存了哪些內(nèi)容:
>>> adata
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'leiden'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
如果想保存adata對(duì)象里某一些內(nèi)容,存成csv格式:
# Export single fields of the annotation of observations
>>> adata.obs[['n_counts', 'louvain_groups']].to_csv(
'./write/pbmc3k_corrected_louvain_groups.csv')
# Export single columns of the multidimensional annotation
>>> adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
'./write/pbmc3k_corrected_X_pca.csv')
# Or export everything except the data using `.write_csvs`.
# Set `skip_data=False` if you also want to export the data.
>>> adata.write_csvs(results_file[:-5], )