單細(xì)胞數(shù)據(jù)質(zhì)控-雙細(xì)胞預(yù)測(cè)-scrublet使用教程

作者：ahworld
鏈接：單細(xì)胞數(shù)據(jù)質(zhì)控-雙細(xì)胞預(yù)測(cè)-scrublet使用教程
來源：微信公眾號(hào)-seqyuan
著作權(quán)歸作者所有，任何形式的轉(zhuǎn)載都請(qǐng)聯(lián)系作者。

在分析scRNA-seq數(shù)據(jù)之前初茶，我們必須確保所有細(xì)胞barcode均與活細(xì)胞相對(duì)應(yīng)。通潮静基于三個(gè)QC協(xié)變量執(zhí)行細(xì)胞QC（Quality control）：

每個(gè)barcode的數(shù)量
每個(gè)barcode對(duì)應(yīng)的基因數(shù)量
每個(gè)barcode的數(shù)量中線粒體基因的占比

通過這些QC協(xié)變量的分布圖燥透，可以通過閾值過濾掉離群峰。

這些異常的barcodes對(duì)應(yīng)著：

死細(xì)胞
細(xì)胞膜破損的細(xì)胞
雙細(xì)胞(doublets)

例如币旧，barcodes計(jì)數(shù)深度低践险，檢測(cè)到的基因很少且線粒體計(jì)數(shù)高，這表明細(xì)胞的細(xì)胞質(zhì)mRNA已通過破膜滲出，因此巍虫，僅位于線粒體中的mRNA仍然在細(xì)胞內(nèi)彭则。相反，具有非預(yù)期高計(jì)數(shù)和檢測(cè)到大量基因的細(xì)胞可能代表雙細(xì)胞占遥。

檢測(cè)scRNA-seq中雙細(xì)胞的分析鑒定工具總結(jié)了以下幾種：

scrublet (python)
DoubletDetection (python)
DoubletDecon (R)
DoubletFinder (R)

這些雙細(xì)胞的分析鑒定工具在2019年發(fā)表的《單細(xì)胞數(shù)據(jù)分析最佳實(shí)踐》中也有推薦(Luecken M D et al, 2019)

本期將對(duì)scrublet的使用做一個(gè)詳細(xì)介紹

scrublet的使用

scrublet文獻(xiàn):Wolock S L, Lopez R, Klein A M. Scrublet: computational identification of cell doublets in single-cell transcriptomic data[J]. Cell systems, 2019, 8(4): 281-291. e9.
scrublet教程參考來源鏈接

安裝

scrublet為python語(yǔ)言編寫的包俯抖，用以下pip命令安裝就可以。

pip install scrublet

使用注意事項(xiàng)

處理來自多個(gè)樣本的數(shù)據(jù)時(shí)瓦胎，請(qǐng)分別對(duì)每個(gè)樣本運(yùn)行Scrublet芬萍。因?yàn)镾crublet旨在檢測(cè)由兩個(gè)細(xì)胞的隨機(jī)共封裝形成的technical doublets，所以在merged數(shù)據(jù)集上可能會(huì)表現(xiàn)不佳搔啊，因?yàn)榧?xì)胞類型比例不代表任何單個(gè)樣品柬祠；
檢查doublet score閾值是否合理，并在必要時(shí)進(jìn)行手動(dòng)調(diào)整负芋。并不是所有情況向下doublet score的直方分布圖都是呈現(xiàn)標(biāo)準(zhǔn)的雙峰漫蛔；
UMAP或t-SNE可視化的結(jié)果中，預(yù)測(cè)的雙細(xì)胞應(yīng)該大體上共定位（可能在多個(gè)細(xì)胞群中）示罗。如果不是惩猫，則可能需要調(diào)整doublet score閾值，或更改預(yù)處理參數(shù)以更好地解析數(shù)據(jù)中存在的細(xì)胞狀態(tài)蚜点。

準(zhǔn)備工作

數(shù)據(jù)準(zhǔn)備

下載來自10X Genomics8k的PBMC數(shù)據(jù)集并解壓轧房。

wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz
tar xfz pbmc8k_filtered_gene_bc_matrices.tar.gz

numpy兼容性報(bào)錯(cuò)修復(fù)

高版本numpy帶來的cannot import name '_validate_lenghs'報(bào)錯(cuò)修復(fù)方案

scrublet包的開發(fā)依賴的是比較低版本的numpy，會(huì)用到arraypad.py中的_validate_lenghs函數(shù)绍绘，這個(gè)函數(shù)在比較高的numpy版本中已經(jīng)棄用奶镶，如果安裝的是高版本numpy，可能在運(yùn)行scrublet時(shí)會(huì)有報(bào)錯(cuò)導(dǎo)致中斷：cannot import name '_validate_lenghs'陪拘。

考慮到一般其他軟件包依賴高版本numpy的情況比較多厂镇，不想再降低numpy的版本，所以讓scrublet能夠運(yùn)行下去的修復(fù)方案為如下：

打開終端左刽，進(jìn)入Python環(huán)境捺信，輸入以下代碼，查看Python安裝位置欠痴。

import sys
print(sys.path)

找到arraypad.py的位置 ~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraypad.py迄靠，打開文件，在文件最后添加以下代碼喇辽，保存退出掌挚，問題解決。

def _normalize_shape(ndarray, shape, cast_to_int=True):
    """
    Private function which does some checks and normalizes the possibly
    much simpler representations of ‘pad_width‘, ‘stat_length‘,
    ‘constant_values‘, ‘end_values‘.

    Parameters
    ----------
    narray : ndarray
        Input ndarray
    shape : {sequence, array_like, float, int}, optional
        The width of padding (pad_width), the number of elements on the
        edge of the narray used for statistics (stat_length), the constant
        value(s) to use when filling padded regions (constant_values), or the
        endpoint target(s) for linear ramps (end_values).
        ((before_1, after_1), ... (before_N, after_N)) unique number of
        elements for each axis where `N` is rank of `narray`.
        ((before, after),) yields same before and after constants for each
        axis.
        (constant,) or val is a shortcut for before = after = constant for
        all axes.
    cast_to_int : bool, optional
        Controls if values in ``shape`` will be rounded and cast to int
        before being returned.

    Returns
    -------
    normalized_shape : tuple of tuples
        val                               => ((val, val), (val, val), ...)
        [[val1, val2], [val3, val4], ...] => ((val1, val2), (val3, val4), ...)
        ((val1, val2), (val3, val4), ...) => no change
        [[val1, val2], ]                  => ((val1, val2), (val1, val2), ...)
        ((val1, val2), )                  => ((val1, val2), (val1, val2), ...)
        [[val ,     ], ]                  => ((val, val), (val, val), ...)
        ((val ,     ), )                  => ((val, val), (val, val), ...)

    """
    ndims = ndarray.ndim

    # Shortcut shape=None
    if shape is None:
        return ((None, None), ) * ndims

    # Convert any input `info` to a NumPy array
    shape_arr = np.asarray(shape)

    try:
        shape_arr = np.broadcast_to(shape_arr, (ndims, 2))
    except ValueError:
        fmt = "Unable to create correctly shaped tuple from %s"
        raise ValueError(fmt % (shape,))

    # Cast if necessary
    if cast_to_int is True:
        shape_arr = np.round(shape_arr).astype(int)

    # Convert list of lists to tuple of tuples
    return tuple(tuple(axis) for axis in shape_arr.tolist())


def _validate_lengths(narray, number_elements):
    """
    Private function which does some checks and reformats pad_width and
    stat_length using _normalize_shape.

    Parameters
    ----------
    narray : ndarray
        Input ndarray
    number_elements : {sequence, int}, optional
        The width of padding (pad_width) or the number of elements on the edge
        of the narray used for statistics (stat_length).
        ((before_1, after_1), ... (before_N, after_N)) unique number of
        elements for each axis.
        ((before, after),) yields same before and after constants for each
        axis.
        (constant,) or int is a shortcut for before = after = constant for all
        axes.

    Returns
    -------
    _validate_lengths : tuple of tuples
        int                               => ((int, int), (int, int), ...)
        [[int1, int2], [int3, int4], ...] => ((int1, int2), (int3, int4), ...)
        ((int1, int2), (int3, int4), ...) => no change
        [[int1, int2], ]                  => ((int1, int2), (int1, int2), ...)
        ((int1, int2), )                  => ((int1, int2), (int1, int2), ...)
        [[int ,     ], ]                  => ((int, int), (int, int), ...)
        ((int ,     ), )                  => ((int, int), (int, int), ...)

    """
    normshp = _normalize_shape(narray, number_elements)
    for i in normshp:
        chk = [1 if x is None else x for x in i]
        chk = [1 if x >= 0 else -1 for x in chk]
        if (chk[0] < 0) or (chk[1] < 0):
            fmt = "%s cannot contain negative values."
            raise ValueError(fmt % (number_elements,))
    return normshp

scrublet使用教程

我的測(cè)試執(zhí)行環(huán)境是MACOS jupyter notebook菩咨，以下代碼為python包的載入和畫圖設(shè)置：

%matplotlib inline
import scrublet as scr
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rc('font', size=14)
plt.rcParams['pdf.fonttype'] = 42

讀入10X的scRNA-seq矩陣吠式，讀入raw counts矩陣為scipy sparse矩陣陡厘，cells作為行，genes作為列：

input_dir = '/Users/yuanzan/Desktop/doublets/filtered_gene_bc_matrices/GRCh38/'
counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()
genes = np.array(scr.load_genes(input_dir + '/genes.tsv', delimiter='\t', column=1))
out_df = pd.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])


print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))
print('Number of genes in gene list: {}'.format(len(genes)))

Counts matrix shape: 8381 rows, 33694 columns
Number of genes in gene list: 33694

初始化Scrublet對(duì)象

相關(guān)參數(shù)為：

expected_doublet_rate特占，doublets的預(yù)期占比糙置，通常為0.05-0.1，結(jié)果對(duì)該參數(shù)不是特別敏感摩钙。對(duì)于此示例數(shù)據(jù)罢低，預(yù)期的doublets占比來自Chromium用戶指南
sim_doublet_ratio，要模擬的doublets數(shù)量相對(duì)于轉(zhuǎn)錄組的觀測(cè)值的比例胖笛。此值應(yīng)該足夠高网持，以使所有的doublet狀態(tài)都能很好地由模擬doublets表示。設(shè)置得太高會(huì)使計(jì)算量增大长踊，默認(rèn)值是2（盡管設(shè)置低至0.5的值也對(duì)測(cè)試的數(shù)據(jù)集產(chǎn)生非常相似的結(jié)果功舀。
n_neighbors，用于構(gòu)造轉(zhuǎn)錄組觀測(cè)值和模擬doublets的KNN分類器的鄰居數(shù)身弊。默認(rèn)值為round（0.5 * sqrt（n_cells））辟汰，通常表現(xiàn)比較好。

scrub = scr.Scrublet(counts_matrix, expected_doublet_rate=0.06)

計(jì)算doublet score

運(yùn)行下面的代碼計(jì)算doublet score阱佛，內(nèi)部處理過程包括:

Doublet simulation
Normalization, gene filtering, rescaling, PCA
Doublet score calculation
Doublet score threshold detection and doublet calling

doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=30)

繪制doublet score分布直方圖

Doublet score分布直方圖包括觀察到的轉(zhuǎn)錄組和模擬的doublet帖汞，模擬的doublet直方圖通常是雙峰的。

下面左圖模式對(duì)應(yīng)于由具有相似基因表達(dá)的兩個(gè)細(xì)胞產(chǎn)生的"embedded" doublets凑术；
右圖模式對(duì)應(yīng)"neotypic" doublets翩蘸，由具有不同基因表達(dá)的細(xì)胞（例如，不同類型的細(xì)胞）產(chǎn)生淮逊，這些會(huì)在下游分析中引入更多的假象催首。Scrublet只能檢測(cè)"neotypic" doublets。

要call doublets vs. singlets泄鹏，我們必須設(shè)置一個(gè)doublet score閾值郎任，理想情況下，閾值應(yīng)在模擬doublet直方圖的兩種模式之間設(shè)置最小值备籽。scrub_doublets()函數(shù)嘗試自動(dòng)識(shí)別這一點(diǎn)舶治，在這個(gè)測(cè)試數(shù)據(jù)示例中表現(xiàn)比較好。如果自動(dòng)閾值檢測(cè)效果不佳车猬，則可以使用call_doublets()函數(shù)調(diào)整閾值霉猛，例如：

scrub.call_doublets(threshold=0.25)

# 畫doublet score直方圖
scrub.plot_histogram()

降維可視化

降維計(jì)算

這個(gè)示例采用UMAP降維，還有tSNE可選诈唬，作者不推薦用tSNE韩脏，因?yàn)檫\(yùn)行比較慢缩麸。

print('Running UMAP...')
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
print('Done.')

UMAP可視化

scrub.plot_embedding('UMAP', order_points=True)

下面左圖黑色的點(diǎn)為預(yù)測(cè)的doublets铸磅。

# doublets占比
print (scrub.detected_doublet_rate_)
# 0.043789523923159525

把doublets預(yù)測(cè)結(jié)果保存到文件，后續(xù)用Seurat等軟件處理的時(shí)候可以導(dǎo)入doublets的預(yù)測(cè)結(jié)果對(duì)barcode進(jìn)行篩選。

out_df['doublet_scores'] = doublet_scores
out_df['predicted_doublets'] = predicted_doublets
out_df.to_csv(input_dir + '/doublet.txt', index=False,header=True)
out_df.head()

barcode	doublet_scores	predicted_doublets
AAACCTGAGCATCATC-1	0.020232985898221900	FALSE
AAACCTGAGCTAACTC-1	0.009746972531259230	FALSE
AAACCTGAGCTAGTGG-1	0.013493253373313300	FALSE
AAACCTGCACATTAGC-1	0.087378640776699	FALSE
AAACCTGCACTGTTAG-1	0.02405046655276650	FALSE
AAACCTGCATAGTAAG-1	0.03969184391224250	FALSE
AAACCTGCATGAACCT-1	0.030082836796977200	FALSE

此次測(cè)試的jupyter notebook上傳到了我的github-seqyuan阅仔，有需要可以下載測(cè)試吹散。

最后編輯于：2020.04.14 11:02:38

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市八酒，隨后出現(xiàn)的幾起案子空民，更是在濱河造成了極大的恐慌，老刑警劉巖羞迷，帶你破解...
沈念sama閱讀 206,126評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件界轩，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡衔瓮，警方通過查閱死者的電腦和手機(jī)浊猾，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,254評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來热鞍，“玉大人葫慎，你說我怎么就攤上這事∞背瑁” “怎么了偷办？”我有些...
開封第一講書人閱讀 152,445評(píng)論 0贊 341
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)澄港。經(jīng)常有香客問我椒涯，道長(zhǎng)，這世上最難降的妖魔是什么慢睡？我笑而不...
開封第一講書人閱讀 55,185評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任逐工，我火速辦了婚禮，結(jié)果婚禮上泪喊，老公的妹妹穿的比我還像新娘髓涯。我一直安慰自己，他們只是感情好纬纪，可當(dāng)我...
茶點(diǎn)故事閱讀 64,178評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布包各。她就那樣靜靜地躺著摘仅，像睡著了一般。火紅的嫁衣襯著肌膚如雪问畅。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 48,970評(píng)論 1贊 284
城市分裂傳說
那天，我揣著相機(jī)與錄音砚亭，去河邊找鬼殴玛。笑死滚粟，一個(gè)胖子當(dāng)著我的面吹牛坦刀，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播沐寺，決...
沈念sama閱讀 38,276評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼钢坦！你這毒婦竟也來了爹凹？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 36,927評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎颗管，沒想到半個(gè)月后垦江，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體比吭，經(jīng)...
沈念sama閱讀 43,400評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡梗逮，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,883評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了底哗。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片跋选。...
茶點(diǎn)故事閱讀 37,997評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡前标，死狀恐怖炼列，靈堂內(nèi)的尸體忽然破棺而出俭尖，到底是詐尸還是另有隱情洞翩，我是刑警寧澤骚亿，帶...
沈念sama閱讀 33,646評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布陷猫，位于F島的核電站绣檬，受9級(jí)特大地震影響嫂粟，放射性物質(zhì)發(fā)生泄漏星虹。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,213評(píng)論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望玩裙。院中可真熱鬧吃溅，春花似錦鸯檬、人聲如沸喧务。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,204評(píng)論 0贊 19
一樁弒父案痊土，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)赁酝。三九已至酌呆，卻和暖如春隙袁，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背梨睁。一陣腳步聲響...
開封第一講書人閱讀 31,423評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來泰國(guó)打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留遍坟，地道東北人愿伴。一個(gè)月前我還...
沈念sama閱讀 45,423評(píng)論 2贊 352
代替公主和親
正文我出身青樓隔节，卻偏偏與公主長(zhǎng)得像官帘，于是被迫代替她去往敵國(guó)和親刽虹。傳聞我的和親對(duì)象是個(gè)殘疾皇子涌哲，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,722評(píng)論 2贊 345