【單細(xì)胞組學(xué)】CITE-seq-Count:分析CITEseq測(cè)序數(shù)據(jù)

導(dǎo)言

昨天我們介紹了CITE-seq的測(cè)序原理和簡(jiǎn)單應(yīng)用趟卸。目前很多國(guó)內(nèi)外公司都能做CITE-seq測(cè)序筒严,至少我們知道10x Genomics是可以做的晒哄。有了測(cè)序數(shù)據(jù)(reads文件)寒锚,接下來(lái)我們看看怎么得到蛋白的表達(dá)數(shù)據(jù)。

CITE-seq的研發(fā)者創(chuàng)建的網(wǎng)站上(https://cite-seq.com/computational-tools/)給出了推薦使用的軟件莫辨。

對(duì)于read-level數(shù)據(jù)可以使用CITE-seq-Count來(lái)獲得count 矩陣

CITE-seq-Count的安裝和使用

CITE-seq-Count是用Python搭建的工具,所以可以用pip安裝:

pip install CITE-seq-Count==1.4.3

How to use it

CITE-seq-Count -R1 TAGS_R1.fastq.gz -R2 TAGS_R2.fastq.gz -t TAG_LIST.csv -cbf X1 -cbl X2 -umif Y1 -umil Y2 -cells EXPECTED_CELLS -o OUTFOLDER

這條命令將對(duì)比對(duì)到抗體DNA標(biāo)簽序列的reads和UMIs進(jìn)行計(jì)數(shù)。

下圖解釋了其中的輸入文件TAGS_R1.fastq.gz和TAGS_R2.fastq.gz預(yù)期的結(jié)構(gòu):

其實(shí)官方文檔已經(jīng)寫的很好了州弟,結(jié)構(gòu)非常清晰依溯,而且英文也更容易幫助我們理解其中一些術(shù)語(yǔ)洪添,所以下面的內(nèi)容我基本都不做翻譯逛尚。這個(gè)工具使用起來(lái)很簡(jiǎn)單觉壶,輸出結(jié)果也完全可以和Seurat無(wú)縫連接。

軟件參數(shù):

INPUT

  • [Required] Read1 fastq file location in fastq.gz format. Read 1 typically contains Cell barcode and UMI.

-R1 READ1_PATH.fastq.gz

--read1 READ1_PATH.fastq.gz

  • [Required] Read2 fastq file location in fastq.gz. Read 2 typically contains the Antibody barcode.

-R2 READ2_PATH.fastq.gz

--read2 READ2_PATH.fastq.gz

  • [Required] The path to the csv file containing the antibody barcodes as well as their respective names. You can run tags of different length together.

-t tags.csv, --tags tags.csv

Antibody barcodes structure:

ATGCGA,First_tag_name
GTCATG,Second_tag_name
GCTAGTCGTACGA,Third_tag_nameGCTAGGTGTCGTA,Forth_tag_name

IMPORTANT: You need to provide only the variable region of the TAG in the tags.csv. Please refer to the following examples.

  • CASE1: Legacy barcodes. If you are using barcoes that have a T, C or G plus a polyA tail at the end, the tags.csv file should not contain those additions. Expected barcode:
GCTAGTCGTACGA T AAAAAAAAAA
GCTAGTCGTACGA C AAAAAAAAAA
GCTAGTCGTACGA G AAAAAAAAAA
GCTGTCAGCATAC T AAAAAAAAAA
GCTGTCAGCATAC C AAAAAAAAAAGCTGTCAGCATAC G AAAAAAAAAA

The tags.csv should only contain the part before the T

GCTAGTCGTACGA,tag1GCTGTCAGCATAC,tag2

  • CASE2: Constant sequences. If you are using barcoes that have a constant sequence at the end or at the start, the tags.csv file should only contain the variable part. You should also use the -trim``--start-trim option to tell CITE-seq-Count where the variable part of the barcode starts Expected barcode:
CGTAGTCGTAGCTA GCTAGTCGTACGA GCTAGCTGACTCGTAGTCGTAGCTA AACGTAGCTATGT 
GCTAGCTGACTCGTAGTCGTAGCTA GCTAGCATATCAG GCTAGCTGACT

The tags.csv should only contain the variable parts and use -trim 14 to trim the first 14 bases.

GCTAGTCGTACGA,tag1
AACGTAGCTATGT,tag2GCTAGCATATCAG,tag3

BARCODING

Positions of the cellular and UMI barcodes.

  • [Required] First nucleotide of cell barcode in read 1. For Drop-seq and 10x Genomics this is typically 1.

-cbf CB_FIRST, --cell_barcode_first_base CB_FIRST

  • [Required] Last nucleotide of the cell barcode in read 1. For 10x Genomics this is typically 16. For Drop-seq this depends on the bead configuration, typically 12.

-cbl CB_LAST, --cell_barcode_last_base CB_LAST

  • [Required] First nucleotide of the UMI in read 1. For 10x Genomics this is typically 17. For Drop-seq this is typically 13.

-umif UMI_FIRST, --umi_first_base UMI_FIRST

  • [Required] Last nucleotide of the UMI in read 1. For 10x Genomics this is typically 26. For Drop-seq this is typically 20.

-umil UMI_LAST, --umi_last_base UMI_LASTExample: Barcodes from 1 to 16 and UMI from 17 to 26, then this is the input you need:

-cbf 1 -cbl 16 -umif 17 -umil 26

  • [Optional] How many errors are allowed between two cell barcodes to collapse them onto one cell.

--bc_collapsing_dist N_ERRORS, default 1

  • [Optional] How many errors are allowed between two umi within the same cell and TAG to collapse.

--umi_collapsing_dist N_ERRORS, default 2

  • [Optional] Deactivate UMI correction.

--no_umi_correction

Cells

You have to choose either the number of cells you expect or give it a list of cell barcodes to retrieve.

  • [Required] How many cells you expect in your run.
  • [Optional] If a whitelist is provided.

-cells EXPECTED_CELLS, --expected_cells EXPECTED_CELLS

  • [Optional] Whitelist of cell barcodes provided as a csv file. CITE-seq-Count will search for those barcodes in the data and correct other barcodes based on this list. Will force the output to provide all the barcodes from the whitelist. Please see the guidelines for information regarding specific chemistries.

-wl WHITELIST, --whitelist WHITELISTExample:

ATGCTAGTGCTAGCTAGTCAGGATCGACTGCTAACG

FILTERING

Filtering for structure of the antibody barcode as well as maximum errors.

  • [OPTIONAL] Maximum Levenshtein distance allowed. This allows to catch antibody barcodes that might have --max-error errors compared to the real barcodes. (was -hd in previous versions)

--max-error MAX_ERROR, default 3Example:

If we have this kind of antibody barcode:

ATGCCAGThe script will be looking for ATGCCAG in R2

A MAX_ERROR of 1 will allow barcodes such as ATGTCAG, having one mismatch to be counted as valid.

There is a sanity check when for the MAX_ERROR value chosen to be sure you are not allowing too many mismatches and confuse your antibody barcodes. Mismatches on cell or UMI barcodes are discarded.

  • [Optional] How many bases should we trim before starting to map. See CASE2 of special examples in the

-trim N_BASES, --start-trim N_BASES, default 0

  • [OPTIONAL] Activate sliding window alignement. Use this when you have a protocol that has a variable sequence before the inserted TAG.

--sliding-window, default FalseExample:

The TAG: ATGCTAGCT with a variable prefix: TTCAATTTCA R2 reads:

TTCA ATGCTAGCTAAAAAAAAAAAAAAAAA
TTCAA ATGCTAGCTAAAAAAAAAAAAAAAA
TTCAAT ATGCTAGCTAAAAAAAAAAAAAAA
TTCAATT ATGCTAGCTAAAAAAAAAAAAAA
TTCAATTT ATGCTAGCTAAAAAAAAAAAAATTCAATTTC ATGCTAGCTAAAAAAAAAAAA

OUTPUT

  • [Required] Path to the result folder that will contain both read and umi mtx results as well as a run_report.yaml and potential unmapped tags.

-o OUTFOLDER, --output OUTFOLDER, default Results

  • [Optional] Will output the dense umi count matrix in a tsv format in addition to the sparse outputs.

--dense

OPTIONAL

  • [Optional] Select first N reads to run on instead of all. This is usefull when trying to test out your parameters before running the whole dataset.

-n FIRST_N, --first_n FIRST_N

  • [Optional] How many threads/cores should be used by CITE-seq-Count.

-T N_THREADS, --threads N_THREADS, default Number of available cores

  • [Optional] Output file for unmapped tags. Can be useful to troubelshoot an experiment with low mapping percentage or high "uncorrected cells".

-u OUTFILE, --unmapped-tags OUTFILE, default unmapped.csv

  • [Optional] How many unmapped tags should be written to file

-ut N_UNMAPPED, --unknown-top-tags N_UNMAPPED, default 50

  • [Optional] Print more information about the mapping process. Only use it to find issues. Slows down the processing by a lot.

--debug

軟件輸出結(jié)果

Mtx format

The mtx, matrix market, format is a sparse format for matrices. It only stores non zero values and is becoming popular in single-cell softwares.

The main advantage is that it requires less space than a dense matrix and that you can easily add different feature names within the same object.

For CITE-seq-Count, the output looks like this:

OUTFOLDER/
-- umi_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- read_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- unmapped.csv
-- run_report.yaml

File descriptions

  • features.tsv.gz contains the feature names, in this context our tags.
  • barcodes.tsv.gz contains the cell barcodes.
  • matrix.mtx.gz contains the actual values. read_count and umi_count contain respectively the read counts and the collapsed umi counts. For analysis you should use the umi data. The read_count can be used to check if you have an overamplification or oversequencing issue with your protocol.
  • unmapped.csv contains the top N tags that haven't been mapped.
  • run_report.yaml contains the parameters used for the run as well as some statistics. here is an example:
Date: 2019-10-01Running time: 13.86 seconds
CITE-seq-Count Version: 1.4.3
Reads processed: 1000000
Percentage mapped: 33
Percentage unmapped: 67
Uncorrected cells: 0Correction:
    Cell barcodes collapsing threshold: 1
    Cell barcodes corrected: 57
    UMI collapsing threshold: 2
    UMIs corrected: 329
Run parameters:
    Read1_filename: fastq/test_R1.fastq.gz,fastq/test2_R1.fastq.gz
    Read2_filename: fastq/test_R2.fastq.gz,fastq/test2_R2.fastq.gz
    Cell barcode:
        First position: 1
        Last position: 16
    UMI barcode:
        First position: 17
        Last position: 26
    Expected cells: 100
    Tags max errors: 1
    Start trim: 0

Packages to read MTX

R:

I recommend using Seurat and their Read10x function to read the results.

With Seurat V3:

Read10x('OUTFOLDER/umi_count/', gene.column=1)

With Matrix:

library(Matrix)
matrix_dir = "/path_to_your_directory/out_cite_seq_count/umi_count/"barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")
mat <- readMM(file = matrix.path)
feature.names = read.delim(features.path, header = FALSE, stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path, header = FALSE, stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1

Python:

I recommend using scanpy and their read_mtx function to read the results.

Example:

import scanpy
import pandas as pd
import os
path = 'umi_cell_corrected'
data = scanpy.read_mtx(os.path.join(path,'umi_count/matrix.mtx.gz'))
data = data.T
features = pd.read_csv(os.path.join(path, 'umi_count/features.tsv.gz'), header=None)
barcodes = pd.read_csv(os.path.join(path, 'umi_count/barcodes.tsv.gz'), header=None)
data.var_names = features[0]
data.obs_names = barcodes[0]

Reference

[1]https://hoohm.github.io/CITE-seq-Count/

歡迎關(guān)注同名公眾號(hào)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市橱赠,隨后出現(xiàn)的幾起案子尤仍,更是在濱河造成了極大的恐慌,老刑警劉巖狭姨,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件宰啦,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡饼拍,警方通過(guò)查閱死者的電腦和手機(jī)赡模,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)师抄,“玉大人漓柑,你說(shuō)我怎么就攤上這事∵端保” “怎么了辆布?”我有些...
    開(kāi)封第一講書人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)茶鉴。 經(jīng)常有香客問(wèn)我锋玲,道長(zhǎng),這世上最難降的妖魔是什么涵叮? 我笑而不...
    開(kāi)封第一講書人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任惭蹂,我火速辦了婚禮,結(jié)果婚禮上围肥,老公的妹妹穿的比我還像新娘剿干。我一直安慰自己,他們只是感情好穆刻,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布置尔。 她就那樣靜靜地躺著,像睡著了一般氢伟。 火紅的嫁衣襯著肌膚如雪榜轿。 梳的紋絲不亂的頭發(fā)上幽歼,一...
    開(kāi)封第一講書人閱讀 48,970評(píng)論 1 284
  • 那天,我揣著相機(jī)與錄音谬盐,去河邊找鬼甸私。 笑死,一個(gè)胖子當(dāng)著我的面吹牛飞傀,可吹牛的內(nèi)容都是我干的皇型。 我是一名探鬼主播,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼砸烦,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼弃鸦!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起幢痘,我...
    開(kāi)封第一講書人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤唬格,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后颜说,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體购岗,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年门粪,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了喊积。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡庄拇,死狀恐怖注服,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情措近,我是刑警寧澤,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布女淑,位于F島的核電站瞭郑,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏鸭你。R本人自食惡果不足惜屈张,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望袱巨。 院中可真熱鬧阁谆,春花似錦、人聲如沸愉老。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)嫉入。三九已至焰盗,卻和暖如春璧尸,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背熬拒。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工爷光, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人澎粟。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓蛀序,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親活烙。 傳聞我的和親對(duì)象是個(gè)殘疾皇子徐裸,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容