作者:童蒙
編輯:angelica
代碼解讀scanpy又來啦杂穷,不要錯(cuò)過~~今天我們講的是:高可變基因的篩選。
函數(shù)
scanpy.pp.highly_variable_genes
功能
取出高可變基因,默認(rèn)使用log的數(shù)據(jù)耘眨,當(dāng)使用flavor=seurat_v3的時(shí)候鄙币,采用count data。
flavor參數(shù)可以選擇是使用Seurat夯辖,Cell ranger還是seurat v3的算法。
Seurat and Cellranger中董饰,使用的是dispersion-based方法蒿褂,獲得歸一化的方差。先對(duì)基因按照表達(dá)量平均值進(jìn)行分bin卒暂,然后計(jì)算落在每個(gè)bin的基因的離散度(dispersion)的均值和SD啄栓,最終獲得歸一化的dispersion。對(duì)于每個(gè)表達(dá)量的bin也祠,選擇不同的高可變表達(dá)基因昙楚。
而Seurat3的算法,計(jì)算每個(gè)基因的方差進(jìn)行歸一化诈嘿。首先對(duì)數(shù)據(jù)在規(guī)范化標(biāo)準(zhǔn)偏差下(a regularized standard deviation)進(jìn)行標(biāo)準(zhǔn)化(使用z標(biāo)準(zhǔn)化)堪旧,之后計(jì)算每個(gè)基因的歸一化的方差,并且進(jìn)行排序奖亚,獲得高可變基因淳梦。
重要參數(shù)
- adata:輸入的數(shù)據(jù),每行是一個(gè)細(xì)胞昔字,每列是一個(gè)特征
- layer:使用的是哪一個(gè)layer
- n_top_genes:如果是使用seurate_v3的方法爆袍,那么需要指定該參數(shù)。
- min_mean:默認(rèn)0.0125 ;max_mean:默認(rèn)是3 螃宙;min_disp: 默認(rèn)0.5蛮瞄, max_disp: 默認(rèn)是inf。如果指定了n_top_genes , 這個(gè)和其他所有mean和disp參數(shù)都會(huì)無效谆扎,因此設(shè)置了 flavor='seurat_v3' 該參數(shù)無用挂捅。
- span:默認(rèn)是0.3;當(dāng)flavor=seurat_v3的時(shí)候堂湖,用loess模型來估計(jì)variance的數(shù)據(jù)的比例闲先。
- n_bins : 默認(rèn)是20,對(duì)表達(dá)量分bin的數(shù)目无蜂,對(duì)每個(gè)bin里的數(shù)據(jù)進(jìn)行歸一化伺糠,如果只有一個(gè)基因落到bin里,那么該bin的dispersion會(huì)設(shè)為1斥季。
- flavor: {‘seurat’, ‘cell_ranger’, ‘seurat_v3’} (default: 'seurat')
- subset:默認(rèn)是false训桶,只是返回高可變基因,否則就原位替換
- inplace:默認(rèn)是True酣倾,在var中進(jìn)行存儲(chǔ)矩陣
- batch_key:沒看懂
If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = 'seurat_v3', ties are broken by the median (across batches) rank based on within-batch normalized variance.
-
check_values:True舵揭,在seurat_v3模式下有用,檢測每個(gè)count是不是為整型
代碼
## _highly_variable_genes.py
mean, var = materialize_as_ndarray(_get_mean_var(X))
# now actually compute the dispersion
mean[mean == 0] = 1e-12 # set entries equal to zero to small value
dispersion = var / mean
df['dispersions_norm'] = (
df['dispersions'].values # use values here as index differs
- disp_mean_bin[df['mean_bin'].values].values
) / disp_std_bin[df['mean_bin'].values].values
獲得每個(gè)基因的dispersion值躁锡,并進(jìn)行排序
mean, var = _get_mean_var(X_batch)
not_const = var > 0
estimat_var = np.zeros(X.shape[1], dtype=np.float64)
y = np.log10(var[not_const])
x = np.log10(mean[not_const])
model = loess(x, y, span=span, degree=2) ### 對(duì)mean和var進(jìn)行l(wèi)oess回歸
model.fit()
estimat_var[not_const] = model.outputs.fitted_values
reg_std = np.sqrt(10 ** estimat_var)
batch_counts = X_batch.astype(np.float64).copy()
參考資料
https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html