導(dǎo)言
昨天我們介紹了CITE-seq的測(cè)序原理和簡(jiǎn)單應(yīng)用趟卸。目前很多國(guó)內(nèi)外公司都能做CITE-seq測(cè)序筒严,至少我們知道10x Genomics是可以做的晒哄。有了測(cè)序數(shù)據(jù)(reads文件)寒锚,接下來(lái)我們看看怎么得到蛋白的表達(dá)數(shù)據(jù)。
CITE-seq的研發(fā)者創(chuàng)建的網(wǎng)站上(https://cite-seq.com/computational-tools/)給出了推薦使用的軟件莫辨。
CITE-seq-Count的安裝和使用
CITE-seq-Count是用Python搭建的工具,所以可以用pip安裝:
pip install CITE-seq-Count==1.4.3
How to use it
CITE-seq-Count -R1 TAGS_R1.fastq.gz -R2 TAGS_R2.fastq.gz -t TAG_LIST.csv -cbf X1 -cbl X2 -umif Y1 -umil Y2 -cells EXPECTED_CELLS -o OUTFOLDER
這條命令將對(duì)比對(duì)到抗體DNA標(biāo)簽序列的reads和UMIs進(jìn)行計(jì)數(shù)。
下圖解釋了其中的輸入文件TAGS_R1.fastq.gz和TAGS_R2.fastq.gz預(yù)期的結(jié)構(gòu):
其實(shí)官方文檔已經(jīng)寫的很好了州弟,結(jié)構(gòu)非常清晰依溯,而且英文也更容易幫助我們理解其中一些術(shù)語(yǔ)洪添,所以下面的內(nèi)容我基本都不做翻譯逛尚。這個(gè)工具使用起來(lái)很簡(jiǎn)單觉壶,輸出結(jié)果也完全可以和Seurat無(wú)縫連接。
軟件參數(shù):
INPUT
- [Required] Read1 fastq file location in fastq.gz format. Read 1 typically contains Cell barcode and UMI.
-R1 READ1_PATH.fastq.gz
--read1 READ1_PATH.fastq.gz
- [Required] Read2 fastq file location in fastq.gz. Read 2 typically contains the Antibody barcode.
-R2 READ2_PATH.fastq.gz
--read2 READ2_PATH.fastq.gz
- [Required] The path to the csv file containing the antibody barcodes as well as their respective names. You can run tags of different length together.
-t tags.csv, --tags tags.csv
Antibody barcodes structure:
ATGCGA,First_tag_name
GTCATG,Second_tag_name
GCTAGTCGTACGA,Third_tag_nameGCTAGGTGTCGTA,Forth_tag_name
IMPORTANT: You need to provide only the variable region of the TAG in the tags.csv. Please refer to the following examples.
- CASE1: Legacy barcodes. If you are using barcoes that have a T, C or G plus a polyA tail at the end, the tags.csv file should not contain those additions. Expected barcode:
GCTAGTCGTACGA T AAAAAAAAAA
GCTAGTCGTACGA C AAAAAAAAAA
GCTAGTCGTACGA G AAAAAAAAAA
GCTGTCAGCATAC T AAAAAAAAAA
GCTGTCAGCATAC C AAAAAAAAAAGCTGTCAGCATAC G AAAAAAAAAA
The tags.csv should only contain the part before the T
GCTAGTCGTACGA,tag1GCTGTCAGCATAC,tag2
- CASE2: Constant sequences. If you are using barcoes that have a constant sequence at the end or at the start, the tags.csv file should only contain the variable part. You should also use the
-trim``--start-trim
option to tellCITE-seq-Count
where the variable part of the barcode starts Expected barcode:
CGTAGTCGTAGCTA GCTAGTCGTACGA GCTAGCTGACTCGTAGTCGTAGCTA AACGTAGCTATGT
GCTAGCTGACTCGTAGTCGTAGCTA GCTAGCATATCAG GCTAGCTGACT
The tags.csv should only contain the variable parts and use -trim 14
to trim the first 14 bases.
GCTAGTCGTACGA,tag1
AACGTAGCTATGT,tag2GCTAGCATATCAG,tag3
BARCODING
Positions of the cellular and UMI barcodes.
- [Required] First nucleotide of cell barcode in read 1. For Drop-seq and 10x Genomics this is typically 1.
-cbf CB_FIRST, --cell_barcode_first_base CB_FIRST
- [Required] Last nucleotide of the cell barcode in read 1. For 10x Genomics this is typically 16. For Drop-seq this depends on the bead configuration, typically 12.
-cbl CB_LAST, --cell_barcode_last_base CB_LAST
- [Required] First nucleotide of the UMI in read 1. For 10x Genomics this is typically 17. For Drop-seq this is typically 13.
-umif UMI_FIRST, --umi_first_base UMI_FIRST
- [Required] Last nucleotide of the UMI in read 1. For 10x Genomics this is typically 26. For Drop-seq this is typically 20.
-umil UMI_LAST, --umi_last_base UMI_LAST
Example: Barcodes from 1 to 16 and UMI from 17 to 26, then this is the input you need:
-cbf 1 -cbl 16 -umif 17 -umil 26
- [Optional] How many errors are allowed between two cell barcodes to collapse them onto one cell.
--bc_collapsing_dist N_ERRORS
, default 1
- [Optional] How many errors are allowed between two umi within the same cell and TAG to collapse.
--umi_collapsing_dist N_ERRORS
, default 2
- [Optional] Deactivate UMI correction.
--no_umi_correction
Cells
You have to choose either the number of cells you expect or give it a list of cell barcodes to retrieve.
- [Required] How many cells you expect in your run.
- [Optional] If a whitelist is provided.
-cells EXPECTED_CELLS, --expected_cells EXPECTED_CELLS
- [Optional] Whitelist of cell barcodes provided as a csv file. CITE-seq-Count will search for those barcodes in the data and correct other barcodes based on this list. Will force the output to provide all the barcodes from the whitelist. Please see the guidelines for information regarding specific chemistries.
-wl WHITELIST, --whitelist WHITELIST
Example:
ATGCTAGTGCTAGCTAGTCAGGATCGACTGCTAACG
FILTERING
Filtering for structure of the antibody barcode as well as maximum errors.
- [OPTIONAL] Maximum Levenshtein distance allowed. This allows to catch antibody barcodes that might have
--max-error
errors compared to the real barcodes. (was-hd
in previous versions)
--max-error MAX_ERROR
, default 3
Example:
If we have this kind of antibody barcode:
ATGCCAG
The script will be looking for ATGCCAG in R2
A MAX_ERROR
of 1 will allow barcodes such as ATGTCAG
, having one mismatch to be counted as valid.
There is a sanity check when for the MAX_ERROR
value chosen to be sure you are not allowing too many mismatches and confuse your antibody barcodes. Mismatches on cell or UMI barcodes are discarded.
- [Optional] How many bases should we trim before starting to map. See
CASE2
of special examples in the
-trim N_BASES, --start-trim N_BASES
, default 0
- [OPTIONAL] Activate sliding window alignement. Use this when you have a protocol that has a variable sequence before the inserted TAG.
--sliding-window
, default False
Example:
The TAG: ATGCTAGCT
with a variable prefix: TTCAATTTCA
R2 reads:
TTCA ATGCTAGCTAAAAAAAAAAAAAAAAA
TTCAA ATGCTAGCTAAAAAAAAAAAAAAAA
TTCAAT ATGCTAGCTAAAAAAAAAAAAAAA
TTCAATT ATGCTAGCTAAAAAAAAAAAAAA
TTCAATTT ATGCTAGCTAAAAAAAAAAAAATTCAATTTC ATGCTAGCTAAAAAAAAAAAA
OUTPUT
- [Required] Path to the result folder that will contain both read and umi mtx results as well as a
run_report.yaml
and potential unmapped tags.
-o OUTFOLDER, --output OUTFOLDER
, default Results
- [Optional] Will output the dense umi count matrix in a tsv format in addition to the sparse outputs.
--dense
OPTIONAL
- [Optional] Select first N reads to run on instead of all. This is usefull when trying to test out your parameters before running the whole dataset.
-n FIRST_N, --first_n FIRST_N
- [Optional] How many threads/cores should be used by CITE-seq-Count.
-T N_THREADS, --threads N_THREADS
, default Number of available cores
- [Optional] Output file for unmapped tags. Can be useful to troubelshoot an experiment with low mapping percentage or high "uncorrected cells".
-u OUTFILE, --unmapped-tags OUTFILE
, default unmapped.csv
- [Optional] How many unmapped tags should be written to file
-ut N_UNMAPPED, --unknown-top-tags N_UNMAPPED
, default 50
- [Optional] Print more information about the mapping process. Only use it to find issues. Slows down the processing by a lot.
--debug
軟件輸出結(jié)果
Mtx format
The mtx, matrix market, format is a sparse format for matrices. It only stores non zero values and is becoming popular in single-cell softwares.
The main advantage is that it requires less space than a dense matrix and that you can easily add different feature names within the same object.
For CITE-seq-Count, the output looks like this:
OUTFOLDER/
-- umi_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- read_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- unmapped.csv
-- run_report.yaml
File descriptions
-
features.tsv.gz
contains the feature names, in this context our tags. -
barcodes.tsv.gz
contains the cell barcodes. -
matrix.mtx.gz
contains the actual values. read_count and umi_count contain respectively the read counts and the collapsed umi counts. For analysis you should use the umi data. The read_count can be used to check if you have an overamplification or oversequencing issue with your protocol. -
unmapped.csv
contains the top N tags that haven't been mapped. -
run_report.yaml
contains the parameters used for the run as well as some statistics. here is an example:
Date: 2019-10-01Running time: 13.86 seconds
CITE-seq-Count Version: 1.4.3
Reads processed: 1000000
Percentage mapped: 33
Percentage unmapped: 67
Uncorrected cells: 0Correction:
Cell barcodes collapsing threshold: 1
Cell barcodes corrected: 57
UMI collapsing threshold: 2
UMIs corrected: 329
Run parameters:
Read1_filename: fastq/test_R1.fastq.gz,fastq/test2_R1.fastq.gz
Read2_filename: fastq/test_R2.fastq.gz,fastq/test2_R2.fastq.gz
Cell barcode:
First position: 1
Last position: 16
UMI barcode:
First position: 17
Last position: 26
Expected cells: 100
Tags max errors: 1
Start trim: 0
Packages to read MTX
R:
I recommend using Seurat
and their Read10x
function to read the results.
With Seurat V3:
Read10x('OUTFOLDER/umi_count/', gene.column=1)
With Matrix:
library(Matrix)
matrix_dir = "/path_to_your_directory/out_cite_seq_count/umi_count/"barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")
mat <- readMM(file = matrix.path)
feature.names = read.delim(features.path, header = FALSE, stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path, header = FALSE, stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1
Python:
I recommend using scanpy
and their read_mtx function to read the results.
Example:
import scanpy
import pandas as pd
import os
path = 'umi_cell_corrected'
data = scanpy.read_mtx(os.path.join(path,'umi_count/matrix.mtx.gz'))
data = data.T
features = pd.read_csv(os.path.join(path, 'umi_count/features.tsv.gz'), header=None)
barcodes = pd.read_csv(os.path.join(path, 'umi_count/barcodes.tsv.gz'), header=None)
data.var_names = features[0]
data.obs_names = barcodes[0]
Reference
[1]https://hoohm.github.io/CITE-seq-Count/
歡迎關(guān)注同名公眾號(hào)