1.數(shù)據(jù)和包準(zhǔn)備
著名的pbmc3k數(shù)據(jù):
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
代碼參考自scanpy的標(biāo)準(zhǔn)流程:
https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html
https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#manual-cell-type-annotation
import pandas as pd
import scanpy as sc
import numpy as np
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor="white")
results_file = "write/pbmc3k.h5ad" # the file that will store the analysis results
scanpy==1.10.3 anndata==0.10.9 umap==0.5.6 numpy==1.26.4 scipy==1.11.4 pandas==2.0.3 scikit-learn==1.5.2 statsmodels==0.14.4 igraph==0.11.6 louvain==0.8.2 pynndescent==0.5.13
2.讀取數(shù)據(jù)
10X的輸入數(shù)據(jù)是固定的三個(gè)文件漓滔,在工作目錄下新建01_data/毙籽,把三個(gè)文件放進(jìn)去行瑞。
adata = sc.read_10x_mtx(
"01_data", # the directory with the `.mtx` file
var_names="gene_symbols", # use gene symbols for the variable names (variables-axis index)
cache=True, # write a cache file for faster subsequent reading
)
adata.var_names_make_unique()
# this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
adata
... reading from cache file cache/01_data-matrix.h5ad
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
#表達(dá)矩陣?yán)锏臄?shù)值范圍
np.min(adata.X),np.max(adata.X)
(0.0, 419.0)
#基本過(guò)濾
#過(guò)濾前
adata.X.shape
(2700, 32738)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells
#過(guò)濾后
adata.X.shape #可以看到過(guò)濾前后的細(xì)胞數(shù)量和基因數(shù)量
(2700, 13714)
查看感興趣的基因的表達(dá)矩陣
稀疏矩陣不支持直接查看,只能是轉(zhuǎn)換成矩陣或者數(shù)據(jù)框才能查看版述。轉(zhuǎn)換成矩陣就丟失了行名列名,轉(zhuǎn)換成數(shù)據(jù)框更好。
adata[0:6, ['CD3D','TCL1A','MS4A1']].X.toarray()
array([[ 4., 0., 0.],
[ 0., 0., 6.],
[10., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 1., 0., 0.]], dtype=float32)
adata[0:6, ['CD3D','TCL1A','MS4A1']].to_df()
3.質(zhì)控
這里是對(duì)細(xì)胞進(jìn)行的質(zhì)控悴务,指標(biāo)是:
線(xiàn)粒體基因含量不能過(guò)高;
n_genes_by_counts 不能過(guò)高或過(guò)低
為什么譬猫? n_genes_by_counts是每個(gè)細(xì)胞中檢測(cè)到的基因數(shù)量讯檐。total_counts是細(xì)胞內(nèi)檢測(cè)到的分子總數(shù)。n_genes_by_counts過(guò)低染服,表示該細(xì)胞可能已死/將死或是空液滴别洪。太高的total_counts和/或n_genes_by_counts表明“細(xì)胞”實(shí)際上可以是兩個(gè)或多個(gè)細(xì)胞。結(jié)合線(xiàn)粒體基因count數(shù)除去異常值柳刮,即可除去大多數(shù)雙峰/死細(xì)胞/空液滴挖垛,因此它們過(guò)濾是常見(jiàn)的預(yù)處理步驟痒钝。 參考自:https://www.biostars.org/p/407036/
3.1 查看三個(gè)指標(biāo)
在anndata對(duì)象中,細(xì)胞的注釋是在adata.var里痢毒,基因的注釋是在adata.obs里送矩。
adata.var.head()
adata.obs.head()
此時(shí)的表格里僅有基本過(guò)濾時(shí)計(jì)算的指標(biāo)n_genes和n_cells,還沒(méi)有我們需要的過(guò)濾指標(biāo)("n_genes_by_counts", "total_counts", "pct_counts_mt")哪替。需要用calculate_qc_metrics來(lái)計(jì)算栋荸。
#質(zhì)控
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
adata.obs.head()
可以看到,表格里有n_genes和n_genes_by_counts凭舶,好奇二者之間的關(guān)聯(lián)晌块,統(tǒng)計(jì)了一下這兩列數(shù)值相等的個(gè)數(shù),發(fā)現(xiàn)更多的是不相等帅霜。
(adata.obs.n_genes == adata.obs.n_genes_by_counts).value_counts()
False 1924
True 776
Name: count, dtype: int64
搜索得知摸袁,這里的n_genes是剛開(kāi)始基本過(guò)濾時(shí)計(jì)算的,n_genes_by_counts是calculate_qc_metrics計(jì)算的义屏,二者有一點(diǎn)區(qū)別靠汁,后續(xù)質(zhì)控用的是n_genes_by_counts。
有點(diǎn)麻煩闽铐,如果是R語(yǔ)言的seurat就只會(huì)計(jì)算一個(gè)n_Feature蝶怔,對(duì)應(yīng)這里的n_genes_by_counts。所以我們就忽略n_genes吧兄墅。
https://discourse.scverse.org/t/difference-between-n-genes-and-n-genes-by-counts/632
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
# **{"color": "#f8766d"}, # 使用額外的參數(shù)來(lái)設(shè)置顏色
multi_panel=True
)
根據(jù)這個(gè)三個(gè)圖踢星,確定了這個(gè)數(shù)據(jù)的過(guò)濾標(biāo)準(zhǔn):
nFeature_RNA在200~2500之間;線(xiàn)粒體基因占比在5%以下隙咸。
3.2 三個(gè)指標(biāo)之間的相關(guān)性
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
3.3 過(guò)濾
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :].copy()
鏈?zhǔn)劫x值:如果你在一段代碼中連續(xù)對(duì)數(shù)據(jù)進(jìn)行多次篩選或其他操作沐悦,可以在最后一步使用 .copy() 來(lái)確保最終結(jié)果是一個(gè)獨(dú)立的副本。這樣可以避免在中間步驟中不小心修改原始數(shù)據(jù)五督。
adata.shape
(2638, 13714)
過(guò)濾之后少了一些基因藏否。
4.找高變基因(HVG)
首先將數(shù)據(jù)矩陣歸一化(校正文庫(kù)大小)為每個(gè)單元格10000條reads充包,以使細(xì)胞之間的counts具有可比性副签。
sc.pp.normalize_total(adata, target_sum=1e4)
normalizing counts per cell
finished (0:00:00)
sc.pp.log1p(adata) #對(duì)數(shù)據(jù)進(jìn)行l(wèi)og
查看感興趣的基因的數(shù)值大小,可見(jiàn)與先前的count不同
adata[0:6, ['ACTB','PPBP','MS4A1']].to_df()
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
adata.var.head()
離散度是基因表達(dá)方差與基因表達(dá)平均水平的比值基矮,用于評(píng)估基因表達(dá)的變異程度淆储。
sc.pl.highly_variable_genes(adata)
#sc.pl.highly_variable_genes(adata, log=True)
查看前10個(gè)高變化基因(因?yàn)閟canpy和seurat的高變化基因默認(rèn)參數(shù)不同,所以找出的基因也不相同)
#adata.var.loc[adata.var.highly_variable == True]
he = adata.var.sort_values(by='dispersions_norm', ascending=False)
print(he[he.highly_variable == True].index.tolist()[0:10])
['DOK3', 'ARVCF', 'YPEL2', 'UBE2D4', 'FAM210B', 'CTB-113I20.2', 'GBGT1', 'LRRIQ3', 'MTIF2', 'TTC8']
adata.raw = adata
存儲(chǔ)一下adata的原始數(shù)據(jù)
adata = adata[:, adata.var.highly_variable] #非必須
5.標(biāo)準(zhǔn)化和降維
5.1 線(xiàn)性降維PCA
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"]) #回歸
sc.pp.scale(adata, max_value=10) # scale
regressing out ['total_counts', 'pct_counts_mt']
sparse input is densified and may lead to high memory use
finished (0:00:22)
sc.tl.pca(adata, svd_solver="arpack")
computing PCA
with n_comps=50
finished (0:00:01)
sc.pl.pca(adata)
sc.pl.pca_variance_ratio(adata, log=True)
adata
AnnData object with n_obs × n_vars = 2638 × 13714
obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca'
obsm: 'X_pca'
varm: 'PCs'
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
using 'X_pca' with n_pcs = 40
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
5.2 非線(xiàn)性降維UMAP
sc.tl.umap(adata)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:02)
leiden是聚類(lèi)家浇,沒(méi)有設(shè)置分辨率參數(shù)本砰,默認(rèn)值是1
# Using the igraph implementation and a fixed number of iterations can be significantly faster, especially for larger datasets
#sc.tl.leiden(adata)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2)
sc.pl.umap(adata, color=["leiden"],
legend_loc="on data",
size=5)
running Leiden clustering
finished: found 8 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
6.marker基因及其可視化
啥叫marker基因呢。和差異基因里面的上調(diào)基因有點(diǎn)類(lèi)似钢悲,某個(gè)基因在某一簇細(xì)胞里表達(dá)量都很高点额,在其他簇表達(dá)量很低青团,那么這個(gè)基因就是這簇細(xì)胞的象征。
6.1 計(jì)算全部cluster的maker基因
# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon",pts=True)
sc.pl.rank_genes_groups_dotplot(
adata, groupby="leiden", standard_scale="var", n_genes=5
)
ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:02)
WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_leiden']`
marker基因表格
pd.DataFrame(adata.uns["rank_genes_groups"]["names"]).head(10)
如果我們想要查看計(jì)算結(jié)果咖楣,像FindAllMarkers那樣的數(shù)據(jù)框督笆,就需要自己整理一番,這段是我自己加的诱贿,好費(fèi)勁哦娃肿。
result = adata.uns["rank_genes_groups"]
result.keys() #看看結(jié)果都包含哪些部分
dict_keys(['params', 'pts', 'pts_rest', 'names', 'scores', 'pvals', 'pvals_adj', 'logfoldchanges'])
取出需要的信息,來(lái)做后面的整理珠十。注意:觀(guān)察發(fā)現(xiàn)pts和pts_rest的行名順序與其他列不同料扰,所以要分開(kāi)整理,避免出錯(cuò)焙蹭。
keys_to_get = ("pvals","logfoldchanges","pvals_adj","names")
subset = {k: pd.DataFrame(result[k]) for k in keys_to_get if k in result}
{key: value.shape for key, value in subset.items()}
{'pvals': (13714, 8),
'logfoldchanges': (13714, 8),
'pvals_adj': (13714, 8),
'names': (13714, 8)}
{key: value.iloc[1:4,1:4,] for key, value in subset.items()}
{'pvals': 1 2 3
1 1.679730e-170 5.878996e-100 1.513503e-88
2 6.935111e-167 2.002411e-98 1.405777e-84
3 2.569135e-154 3.853307e-97 7.510027e-84,
'logfoldchanges': 1 2 3
1 7.749746 4.805129 7.957393
2 4.887034 4.863548 7.636600
3 5.518004 3.643959 4.836388,
'pvals_adj': 1 2 3
1 1.151791e-166 4.031227e-96 1.037809e-84
2 3.170270e-163 9.153686e-95 6.426275e-81
3 8.808280e-151 1.321106e-93 2.574813e-80,
'names': 1 2 3
1 CD79A FCER1G GNLY
2 HLA-DRA AIF1 GZMB
3 CD79B COTL1 CTSW}
官網(wǎng)給的代碼↓晒杈,整理出的格式是每簇占4列,且每簇里的基因排列順序都不同孔厉,不方便進(jìn)行篩選拯钻,所以要繼續(xù)整理成規(guī)范的數(shù)據(jù)框
groups = result["names"].dtype.names
pbmc_markers = pd.DataFrame(
{
group + "_" + key: result[key][group]
for group in groups
for key in [ "pvals","logfoldchanges","pvals_adj","names"]
}
)
pbmc_markers.shape
pbmc_markers.to_csv('pbmc_markers.csv', index=False)
pbmc_markers.head(5)
我自己加的調(diào)整格式的代碼如下:
import pandas as pd
import numpy as np
df = pbmc_markers
n = 4
split_df = []
for start_col in range(0, df.shape[1], n):
end_col = min(start_col + n, df.shape[1])
a = df.iloc[:, start_col:end_col]
a['cluster'] = a.columns.str.split('_').str[0].unique().tolist()* a.shape[0]
a.columns = [col.split("_", 1)[1] for col in a.columns[:n]] + a.columns[n:].tolist()
split_df.append(a)
new_df = pd.concat(split_df, ignore_index=True)
new_df.head()
然后再將每個(gè)基因的細(xì)胞表達(dá)比例加上去
keys_to_get = ('pts', 'pts_rest')
subset = {k: pd.DataFrame(result[k]) for k in keys_to_get if k in result}
subset.keys()
dict_keys(['pts', 'pts_rest'])
pt1 = pd.melt(subset['pts'].reset_index().rename(columns={'index': 'names'}),
id_vars='names', var_name='cluster', value_name='pts')
pt1.head()
pt1.names.equals(pt2.names)
True
pt1["pts_rest"] = pt2.pts_rest
pt1.head()
new_df = pd.merge(new_df, pt1, on=['names', 'cluster'])
new_df.head()
new_df = new_df.loc[((new_df.pts>0.25)|(new_df.pts_rest>0.25))&(new_df.logfoldchanges>0.25) ,]
new_df.shape
(4285, 7)
6.2 marker基因可視化
df = pd.read_table("markers.txt",header=None)
df
df.columns = ['Cell_Type', 'Marker']
# 創(chuàng)建字典,其中每個(gè)細(xì)胞類(lèi)型對(duì)應(yīng)一個(gè)標(biāo)記基因列表
markers = df.groupby('Cell_Type')['Marker'].apply(list).to_dict()
markers
sc.pl.dotplot(adata, markers, groupby='leiden',figsize=(15,4),var_group_rotation=0)
sc.pl.violin(adata,['PPBP','MS4A1','CD8A'],'leiden')
sc.pl.stacked_violin(adata, markers, groupby = "leiden",figsize= (16,7),var_group_rotation=0)
sc.pl.heatmap(adata, markers, groupby='leiden', var_group_rotation=True,figsize=(12,6))
sc.pl.matrixplot(adata, markers, groupby='leiden',figsize=(12,2),var_group_rotation= 0)
sc.pl.tracksplot(adata,markers,'leiden',figsize=(15,4))
7.細(xì)胞類(lèi)型注釋
new_cluster_names = {
'0': "CD4 T",
'1': "B",
'2': "FCGR3A+ Mono",
'3': "NK",
'4': "CD8 T",
'5': "CD14+ Mono",
'6': "DC",
'7': "Platelet"
}
#adata.rename_categories("louvain", new_cluster_names)
adata.obs['leiden'] = adata.obs['leiden'].map(new_cluster_names)
adata.obs['leiden']
AAACATACAACCAC-1 CD4 T
AAACATTGAGCTAC-1 B
AAACATTGATCAGC-1 CD4 T
AAACCGTGCTTCCG-1 FCGR3A+ Mono
AAACCGTGTATGCG-1 NK
...
TTTCGAACTCTCAT-1 CD14+ Mono
TTTCTACTGAGGCA-1 B
TTTCTACTTCCTCG-1 B
TTTGCATGAGAGGC-1 B
TTTGCATGCCTCAC-1 CD4 T
Name: leiden, Length: 2638, dtype: category
Categories (8, object): ['CD4 T', 'B', 'FCGR3A+ Mono', 'NK', 'CD8 T', 'CD14+ Mono', 'DC', 'Platelet']
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 7))
# 使用 scanpy.pl.umap() 繪制 UMAP 圖撰豺,并將圖形繪制到指定的 Axes 對(duì)象上
sc.pl.umap(adata, color='leiden', legend_loc='on data',
frameon=False, legend_fontsize=10, legend_fontoutline=2,
add_outline=True, palette="Set1", ax=ax)
# 顯示圖表
plt.show()
import pandas as pd
umap_coords = adata.obsm['X_umap']
labels = adata.obs['leiden']
plt.figure(figsize=(5,5))
unique_labels = labels.unique()
colors = plt.cm.get_cmap('Set1', len(unique_labels))
for i, label in enumerate(unique_labels):
plt.scatter(umap_coords[labels == label, 0], umap_coords[labels == label, 1],
color=colors(i), alpha=0.5)
for label in unique_labels:
label_coords = umap_coords[labels == label]
center_x = label_coords[:, 0].mean()
center_y = label_coords[:, 1].mean()
plt.text(center_x, center_y, label, fontsize=12, ha='center', va='center')
plt.grid(False)
plt.title('UMAP Plot with Cluster Labels', fontsize=14)
plt.xlabel('UMAP 1', fontsize=12)
plt.ylabel('UMAP 2', fontsize=12)
plt.show()
!mkdir write
adata.write(results_file)