作者:ahworld
鏈接:單細(xì)胞數(shù)據(jù)質(zhì)控-雙細(xì)胞預(yù)測(cè)-scrublet使用教程
來源:微信公眾號(hào)-seqyuan
著作權(quán)歸作者所有,任何形式的轉(zhuǎn)載都請(qǐng)聯(lián)系作者。
在分析scRNA-seq數(shù)據(jù)之前初茶,我們必須確保所有細(xì)胞barcode均與活細(xì)胞相對(duì)應(yīng)。通潮静基于三個(gè)QC協(xié)變量執(zhí)行細(xì)胞QC(Quality control):
- 每個(gè)barcode的數(shù)量
- 每個(gè)barcode對(duì)應(yīng)的基因數(shù)量
- 每個(gè)barcode的數(shù)量中線粒體基因的占比
通過這些QC協(xié)變量的分布圖燥透,可以通過閾值過濾掉離群峰。
這些異常的barcodes對(duì)應(yīng)著:
- 死細(xì)胞
- 細(xì)胞膜破損的細(xì)胞
- 雙細(xì)胞(doublets)
例如币旧,barcodes計(jì)數(shù)深度低践险,檢測(cè)到的基因很少且線粒體計(jì)數(shù)高,這表明細(xì)胞的細(xì)胞質(zhì)mRNA已通過破膜滲出,因此巍虫,僅位于線粒體中的mRNA仍然在細(xì)胞內(nèi)彭则。相反,具有非預(yù)期高計(jì)數(shù)和檢測(cè)到大量基因的細(xì)胞可能代表雙細(xì)胞占遥。
檢測(cè)scRNA-seq中雙細(xì)胞的分析鑒定工具總結(jié)了以下幾種:
- scrublet (python)
- DoubletDetection (python)
- DoubletDecon (R)
- DoubletFinder (R)
這些雙細(xì)胞的分析鑒定工具在2019年發(fā)表的《單細(xì)胞數(shù)據(jù)分析最佳實(shí)踐》中也有推薦(Luecken M D et al, 2019)
本期將對(duì)
scrublet
的使用做一個(gè)詳細(xì)介紹
scrublet的使用
scrublet文獻(xiàn):Wolock S L, Lopez R, Klein A M. Scrublet: computational identification of cell doublets in single-cell transcriptomic data[J]. Cell systems, 2019, 8(4): 281-291. e9.
scrublet教程參考來源鏈接
安裝
scrublet為python語(yǔ)言編寫的包俯抖,用以下pip命令安裝就可以。
pip install scrublet
使用注意事項(xiàng)
-
處理來自多個(gè)樣本的數(shù)據(jù)時(shí)瓦胎,請(qǐng)分別對(duì)每個(gè)樣本運(yùn)行Scrublet
芬萍。 因?yàn)镾crublet旨在檢測(cè)由兩個(gè)細(xì)胞的隨機(jī)共封裝形成的technical doublets,所以在merged數(shù)據(jù)集上可能會(huì)表現(xiàn)不佳搔啊,因?yàn)榧?xì)胞類型比例不代表任何單個(gè)樣品柬祠; -
檢查doublet score閾值是否合理
,并在必要時(shí)進(jìn)行手動(dòng)調(diào)整负芋。并不是所有情況向下doublet score的直方分布圖都是呈現(xiàn)標(biāo)準(zhǔn)的雙峰漫蛔; -
UMAP或t-SNE可視化的結(jié)果中,預(yù)測(cè)的雙細(xì)胞應(yīng)該大體上共定位(可能在多個(gè)細(xì)胞群中)
示罗。 如果不是惩猫,則可能需要調(diào)整doublet score閾值,或更改預(yù)處理參數(shù)以更好地解析數(shù)據(jù)中存在的細(xì)胞狀態(tài)蚜点。
準(zhǔn)備工作
數(shù)據(jù)準(zhǔn)備
下載來自10X Genomics8k的PBMC數(shù)據(jù)集并解壓轧房。
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz
tar xfz pbmc8k_filtered_gene_bc_matrices.tar.gz
numpy兼容性報(bào)錯(cuò)修復(fù)
高版本numpy帶來的cannot import name '_validate_lenghs'
報(bào)錯(cuò)修復(fù)方案
scrublet
包的開發(fā)依賴的是比較低版本的numpy,會(huì)用到arraypad.py
中的_validate_lenghs函數(shù)绍绘,這個(gè)函數(shù)在比較高的numpy版本中已經(jīng)棄用奶镶,如果安裝的是高版本numpy,可能在運(yùn)行scrublet時(shí)會(huì)有報(bào)錯(cuò)導(dǎo)致中斷:cannot import name '_validate_lenghs'
陪拘。
考慮到一般其他軟件包依賴高版本numpy的情況比較多厂镇,不想再降低numpy的版本,所以讓scrublet能夠運(yùn)行下去的修復(fù)方案為如下:
打開終端左刽,進(jìn)入Python環(huán)境捺信,輸入以下代碼,查看Python安裝位置欠痴。
import sys
print(sys.path)
找到arraypad.py的位置 ~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraypad.py迄靠,打開文件,在文件最后添加以下代碼喇辽,保存退出掌挚,問題解決。
def _normalize_shape(ndarray, shape, cast_to_int=True):
"""
Private function which does some checks and normalizes the possibly
much simpler representations of ‘pad_width‘, ‘stat_length‘,
‘constant_values‘, ‘end_values‘.
Parameters
----------
narray : ndarray
Input ndarray
shape : {sequence, array_like, float, int}, optional
The width of padding (pad_width), the number of elements on the
edge of the narray used for statistics (stat_length), the constant
value(s) to use when filling padded regions (constant_values), or the
endpoint target(s) for linear ramps (end_values).
((before_1, after_1), ... (before_N, after_N)) unique number of
elements for each axis where `N` is rank of `narray`.
((before, after),) yields same before and after constants for each
axis.
(constant,) or val is a shortcut for before = after = constant for
all axes.
cast_to_int : bool, optional
Controls if values in ``shape`` will be rounded and cast to int
before being returned.
Returns
-------
normalized_shape : tuple of tuples
val => ((val, val), (val, val), ...)
[[val1, val2], [val3, val4], ...] => ((val1, val2), (val3, val4), ...)
((val1, val2), (val3, val4), ...) => no change
[[val1, val2], ] => ((val1, val2), (val1, val2), ...)
((val1, val2), ) => ((val1, val2), (val1, val2), ...)
[[val , ], ] => ((val, val), (val, val), ...)
((val , ), ) => ((val, val), (val, val), ...)
"""
ndims = ndarray.ndim
# Shortcut shape=None
if shape is None:
return ((None, None), ) * ndims
# Convert any input `info` to a NumPy array
shape_arr = np.asarray(shape)
try:
shape_arr = np.broadcast_to(shape_arr, (ndims, 2))
except ValueError:
fmt = "Unable to create correctly shaped tuple from %s"
raise ValueError(fmt % (shape,))
# Cast if necessary
if cast_to_int is True:
shape_arr = np.round(shape_arr).astype(int)
# Convert list of lists to tuple of tuples
return tuple(tuple(axis) for axis in shape_arr.tolist())
def _validate_lengths(narray, number_elements):
"""
Private function which does some checks and reformats pad_width and
stat_length using _normalize_shape.
Parameters
----------
narray : ndarray
Input ndarray
number_elements : {sequence, int}, optional
The width of padding (pad_width) or the number of elements on the edge
of the narray used for statistics (stat_length).
((before_1, after_1), ... (before_N, after_N)) unique number of
elements for each axis.
((before, after),) yields same before and after constants for each
axis.
(constant,) or int is a shortcut for before = after = constant for all
axes.
Returns
-------
_validate_lengths : tuple of tuples
int => ((int, int), (int, int), ...)
[[int1, int2], [int3, int4], ...] => ((int1, int2), (int3, int4), ...)
((int1, int2), (int3, int4), ...) => no change
[[int1, int2], ] => ((int1, int2), (int1, int2), ...)
((int1, int2), ) => ((int1, int2), (int1, int2), ...)
[[int , ], ] => ((int, int), (int, int), ...)
((int , ), ) => ((int, int), (int, int), ...)
"""
normshp = _normalize_shape(narray, number_elements)
for i in normshp:
chk = [1 if x is None else x for x in i]
chk = [1 if x >= 0 else -1 for x in chk]
if (chk[0] < 0) or (chk[1] < 0):
fmt = "%s cannot contain negative values."
raise ValueError(fmt % (number_elements,))
return normshp
scrublet使用教程
我的測(cè)試執(zhí)行環(huán)境是MACOS jupyter notebook菩咨,以下代碼為python包的載入和畫圖設(shè)置:
%matplotlib inline
import scrublet as scr
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rc('font', size=14)
plt.rcParams['pdf.fonttype'] = 42
讀入10X的scRNA-seq矩陣吠式,讀入raw counts矩陣為scipy sparse矩陣陡厘,cells作為行,genes作為列:
input_dir = '/Users/yuanzan/Desktop/doublets/filtered_gene_bc_matrices/GRCh38/'
counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()
genes = np.array(scr.load_genes(input_dir + '/genes.tsv', delimiter='\t', column=1))
out_df = pd.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])
print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))
print('Number of genes in gene list: {}'.format(len(genes)))
Counts matrix shape: 8381 rows, 33694 columns
Number of genes in gene list: 33694
初始化Scrublet對(duì)象
相關(guān)參數(shù)為:
-
expected_doublet_rate
特占,doublets的預(yù)期占比糙置,通常為0.05-0.1,結(jié)果對(duì)該參數(shù)不是特別敏感摩钙。 對(duì)于此示例數(shù)據(jù)罢低,預(yù)期的doublets占比來自Chromium用戶指南 -
sim_doublet_ratio
,要模擬的doublets數(shù)量相對(duì)于轉(zhuǎn)錄組的觀測(cè)值的比例胖笛。此值應(yīng)該足夠高网持,以使所有的doublet狀態(tài)都能很好地由模擬doublets表示。設(shè)置得太高會(huì)使計(jì)算量增大长踊,默認(rèn)值是2(盡管設(shè)置低至0.5的值也對(duì)測(cè)試的數(shù)據(jù)集產(chǎn)生非常相似的結(jié)果功舀。 -
n_neighbors
,用于構(gòu)造轉(zhuǎn)錄組觀測(cè)值和模擬doublets的KNN分類器的鄰居數(shù)身弊。 默認(rèn)值為round(0.5 * sqrt(n_cells))辟汰,通常表現(xiàn)比較好。
scrub = scr.Scrublet(counts_matrix, expected_doublet_rate=0.06)
計(jì)算doublet score
運(yùn)行下面的代碼計(jì)算doublet score阱佛,內(nèi)部處理過程包括:
- Doublet simulation
- Normalization, gene filtering, rescaling, PCA
- Doublet score calculation
- Doublet score threshold detection and doublet calling
doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=30)
繪制doublet score分布直方圖
Doublet score分布直方圖包括觀察到的轉(zhuǎn)錄組和模擬的doublet帖汞,模擬的doublet直方圖通常是雙峰的。
- 下面
左圖模式
對(duì)應(yīng)于由具有相似基因表達(dá)的兩個(gè)細(xì)胞產(chǎn)生的"embedded" doublets凑术; -
右圖模式
對(duì)應(yīng)"neotypic" doublets翩蘸,由具有不同基因表達(dá)的細(xì)胞(例如,不同類型的細(xì)胞)產(chǎn)生淮逊,這些會(huì)在下游分析中引入更多的假象催首。Scrublet只能檢測(cè)"neotypic" doublets
。
要call doublets vs. singlets泄鹏,我們必須設(shè)置一個(gè)doublet score閾值郎任,理想情況下,閾值應(yīng)在模擬doublet直方圖的兩種模式之間設(shè)置最小值备籽。scrub_doublets()
函數(shù)嘗試自動(dòng)識(shí)別這一點(diǎn)舶治,在這個(gè)測(cè)試數(shù)據(jù)示例中表現(xiàn)比較好。如果自動(dòng)閾值檢測(cè)效果不佳车猬,則可以使用call_doublets()
函數(shù)調(diào)整閾值霉猛,例如:
scrub.call_doublets(threshold=0.25)
# 畫doublet score直方圖
scrub.plot_histogram()
降維可視化
降維計(jì)算
這個(gè)示例采用UMAP降維,還有tSNE可選诈唬,作者不推薦用tSNE韩脏,因?yàn)檫\(yùn)行比較慢缩麸。
print('Running UMAP...')
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
print('Done.')
UMAP可視化
scrub.plot_embedding('UMAP', order_points=True)
下面左圖黑色的點(diǎn)為預(yù)測(cè)的doublets铸磅。
# doublets占比
print (scrub.detected_doublet_rate_)
# 0.043789523923159525
把doublets預(yù)測(cè)結(jié)果保存到文件,后續(xù)用Seurat等軟件處理的時(shí)候可以導(dǎo)入doublets的預(yù)測(cè)結(jié)果對(duì)barcode進(jìn)行篩選。
out_df['doublet_scores'] = doublet_scores
out_df['predicted_doublets'] = predicted_doublets
out_df.to_csv(input_dir + '/doublet.txt', index=False,header=True)
out_df.head()
barcode | doublet_scores | predicted_doublets |
---|---|---|
AAACCTGAGCATCATC-1 | 0.020232985898221900 | FALSE |
AAACCTGAGCTAACTC-1 | 0.009746972531259230 | FALSE |
AAACCTGAGCTAGTGG-1 | 0.013493253373313300 | FALSE |
AAACCTGCACATTAGC-1 | 0.087378640776699 | FALSE |
AAACCTGCACTGTTAG-1 | 0.02405046655276650 | FALSE |
AAACCTGCATAGTAAG-1 | 0.03969184391224250 | FALSE |
AAACCTGCATGAACCT-1 | 0.030082836796977200 | FALSE |
此次測(cè)試的jupyter notebook上傳到了我的github-seqyuan阅仔,有需要可以下載測(cè)試吹散。