作者垂蜗,Evil Genius
單細(xì)胞測(cè)序已成為在遺傳學(xué)德挣、轉(zhuǎn)錄組學(xué)和表觀遺傳學(xué)等不同水平上解開細(xì)胞群體異質(zhì)性的有力技術(shù)屁柏,因此在基礎(chǔ)研究和臨床轉(zhuǎn)化中具有深遠(yuǎn)的意義怪嫌。細(xì)胞基因型主要通過單細(xì)胞DNA-seq (scDNA-seq)檢測(cè)腫瘤中的體細(xì)胞突變,將細(xì)胞聚類成克隆并推斷其進(jìn)化動(dòng)力學(xué)來研究探遵。最近窟赏,越來越多的證據(jù)表明,在其他單細(xì)胞探針中箱季,包括scATAC-seq和全長(zhǎng)scRNA-seq(例如SMART-seq2)涯穷,在核和線粒體基因組上也可以觀察到體細(xì)胞突變的一個(gè)子集。另一方面藏雏,生殖系變異(又稱單核苷酸多態(tài)性拷况,SNPs)在單細(xì)胞測(cè)序數(shù)據(jù)中被更廣泛地觀察到,即使是在液滴為基礎(chǔ)的平臺(tái)上诉稍,如10XGenomics蝠嘉,這要?dú)w功于龐大的候選列表[在人群中約有700萬個(gè)snp,頻率為0.5 %]杯巨。種系snp不僅是perfect natural barcodes when multiplexing cells from multiple individuals,而且在通過細(xì)胞eQTL分析或等位基因特異性表達(dá)和拷貝數(shù)變化引起的等位基因失衡暗示功能調(diào)控方面也具有重要意義努酸。
工具介紹
Cellsnp-lite是在C/ c++中實(shí)現(xiàn)的服爷,并執(zhí)行每個(gè)細(xì)胞基因分型,supporting both with (mode 1) and without (mode 2) given
SNPs。在后一種情況下仍源,雜合snp將被自動(dòng)檢測(cè)心褐。Cellsnp-lite適用于基于液滴的(例如10XGenomics數(shù)據(jù))和well-based的平臺(tái)(例如SMART-seq2數(shù)據(jù))。
Cellsnp-lite需要以bam/sam/cram文件格式的對(duì)齊讀取作為輸入笼踩。細(xì)胞標(biāo)簽可以在多個(gè)bam文件(基于液滴的平臺(tái))中的細(xì)胞標(biāo)簽中進(jìn)行編碼逗爹,也可以由每個(gè)細(xì)胞bam文件(基于良好的平臺(tái))指定。這種靈活性還允許cellsnp-lite在bulk樣品上無縫工作嚎于,例如bulk RNA-seq掘而,只需將其視為基礎(chǔ)良好的“細(xì)胞”。
pileup是在每個(gè)基因組位置進(jìn)行的于购,對(duì)于給定的snp(模式1)或整個(gè)染色體(即模式2)袍睡。將獲取覆蓋查詢位置的所有讀取。默認(rèn)情況下肋僧,丟棄那些低對(duì)齊質(zhì)量的讀取斑胜,包括MAPQ < 20,對(duì)齊長(zhǎng)度< 30 nt和FLAG與UNMAP, SECONDARY, QCFAIL(和DUP嫌吠,如果UMI不適用)止潘。然后,對(duì)于基于液滴的樣本(模式1a或2a)辫诅,我們通過哈希圖將所有這些讀取分配到每個(gè)細(xì)胞中凭戴,或者對(duì)于well-based的細(xì)胞(模式1b或2b)直接分配。在每個(gè)cell中泥栖,計(jì)算所有A, C, G, T, N堿基的umi(如果存在)或讀取簇宽。如果給定(即模式1),則從輸入snp中取出REF和ALT等位基因吧享,否則選擇REF數(shù)量最高的堿基魏割,ALT數(shù)量次之(模式2)。
當(dāng)給定snp(模式1)時(shí)钢颂,cellsnp-lite將通過將輸入snp按順序等分入多個(gè)線程來執(zhí)行并行計(jì)算钞它。否則,在模式2中殊鞭,cellsnp-lite將通過分裂列出的染色體并行計(jì)算遭垛,每條線程對(duì)應(yīng)一條染色體。
在上述所有情況中操灿,cellsnp-lite輸出可選等位基因锯仪、深度(即REF和ALT等位基因)和其他等位基因的稀疏矩陣。如果添加參數(shù)' -genotype '趾盐, cellsnp-lite將使用表1所示的誤差模型進(jìn)行基因分型庶喜,并以VCF格式輸出細(xì)胞作為樣本小腊。
安裝
conda install cellsnp-lite
運(yùn)行
Mode 1: pileup with given SNPs
Mode 1a: droplet-based single cells
Require: a single BAM/SAM/CRAM file, e.g., from CellRanger; a list of cell barcodes, e.g., barcodes.tsv
file in the CellRanger directory, outs/filtered_gene_bc_matrices/
; a VCF file for common SNPs. This mode is recommended comparing to mode 2, if a list of common SNP is known, e.g., human (see Candidate_SNPs)
cellsnp-lite -s $BAM -b $BARCODE -O $OUT_DIR -R $REGION_VCF -p 20 --minMAF 0.1 --minCOUNT 20 --gzip
As shown in the above command line, we recommend filtering SNPs with <20UMIs or <10% minor alleles for downstream donor deconvolution, by adding --minMAF 0.1 --minCOUNT 20
.
Besides, special care needs to be taken when filtering PCR duplicates for scRNA-seq data by including DUP bit in exclFLAG, for the upstream pipeline may mark each extra read sharing the same CB/UMI pair as PCR duplicate, which will result in most variant data being lost. Due to the reason above, cellsnp-lite by default uses a non-DUP exclFLAG value to include PCR duplicates for scRNA-seq data when UMItag is turned on.
Mode 1b: well-based single cells or bulk
Require: one or multiple BAM/SAM/CRAM files (bulk or smart-seq), their according sample ids (optional), and a VCF file for a list of common SNPs. BAM/SAM/CRAM files can be input in comma separated way (-s) or in a list file (-S).
cellsnp-lite -s $BAM1,$BAM2 -I sample_id1,sample_id2 -O $OUT_DIR -R $REGION_VCF -p 20 --cellTAG None --UMItag None --gzip
cellsnp-lite -S $BAM_list_file -i sample_list_file -O $OUT_DIR -R $REGION_VCF -p 20 --cellTAG None --UMItag None --gzip
Mode 2: pileup whole chromosome(s) without given SNPs
For mode2, by default it runs on chr1 to 22 on human. For mouse, you need to specify it to 1,2,…,19 (replace the ellipsis).
This mode may output false positive SNPs, for example somatic variants or falses caused by RNA editing. These false SNPs are probably not consistent in all cells within one individual, hence confounding the demultiplexing. Nevertheless, for species, e.g., zebrafish, without a good list of common SNPs, this strategy is still worth a good try.
Mode 2a: droplet based single cells without given SNPs
# 10x sample with cell barcodes
cellsnp-lite -s $BAM -b $BARCODE -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --gzip
Add --chrom if you only want to genotype specific chromosomes, e.g., 1,2, or chrMT.
Mode 2b: well-based single cells or bulk without SNPs
# a bulk sample without cell barcodes and UMI tag
cellsnp-lite -s $bulkBAM -I Sample0 -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --cellTAG None --UMItag None --gzip
Output
cellsnp-lite outputs at least 5 files listed below (with --gzip
):
cellSNP.base.vcf.gz
: a VCF file listing genotyped SNPs and aggregated AD & DP infomation (without GT).cellSNP.samples.tsv
: a TSV file listing cell barcodes or sample IDs.cellSNP.tag.AD.mtx
: a file in “Matrix Market exchange formats”, containing the allele depths of the alternative (ALT) alleles.cellSNP.tag.DP.mtx
: a file in “Matrix Market exchange formats”, containing the sum of allele depths of the reference and alternative alleles (REF + ALT).cellSNP.tag.OTH.mtx
: a file in “Matrix Market exchange formats”, containing the sum of allele depths of all the alleles other than REF and ALT.
If --genotype
option was specified, then cellsnp-lite would output the cellSNP.cells.vcf.gz
file, a VCF file listing genotyped SNPs and AD & DP & genotype (GT) information for each cell or sample.
Full parameters
Usage: cellsnp-lite [options]
Options:
-s, --samFile STR Indexed sam/bam file(s), comma separated multiple samples.
Mode 1a & 2a: one sam/bam file with single cell.
Mode 1b & 2b: one or multiple bulk sam/bam files,
no barcodes needed, but sample ids and regionsVCF.
-S, --samFileList FILE A list file containing bam files, each per line, for Mode 1b & 2b.
-O, --outDir DIR Output directory for VCF and sparse matrices.
-R, --regionsVCF FILE A vcf file listing all candidate SNPs, for fetch each variants.
If None, pileup the genome. Needed for bulk samples.
-T, --targetsVCF FILE Similar as -R, but the next position is accessed by streaming rather
than indexing/jumping (like -T in samtools/bcftools mpileup).
-b, --barcodeFile FILE A plain file listing all effective cell barcode.
-i, --sampleList FILE A list file containing sample IDs, each per line.
-I, --sampleIDs STR Comma separated sample ids.
-V, --version Print software version and exit.
-h, --help Show this help message and exit.
Optional arguments:
--genotype If use, do genotyping in addition to counting.
--gzip If use, the output files will be zipped into BGZF format.
--printSkipSNPs If use, the SNPs skipped when loading VCF will be printed.
-p, --nproc INT Number of subprocesses [1]
-f, --refseq FILE Faidx indexed reference sequence file. If set, the real (genomic)
ref extracted from this file would be used for Mode 2 or for the
missing REFs in the input VCF for Mode 1.
--chrom STR The chromosomes to use, comma separated [1 to 22]
--cellTAG STR Tag for cell barcodes, turn off with None [CB]
--UMItag STR Tag for UMI: UB, Auto, None. For Auto mode, use UB if barcodes are inputted,
otherwise use None. None mode means no UMI but read counts [Auto]
--minCOUNT INT Minimum aggragated count [20]
--minMAF FLOAT Minimum minor allele frequency [0.00]
--doubletGL If use, keep doublet GT likelihood, i.e., GT=0.5 and GT=1.5.
Read filtering:
--inclFLAG STR|INT Required flags: skip reads with all mask bits unset []
--exclFLAG STR|INT Filter flags: skip reads with any mask bits set [UNMAP,SECONDARY,QCFAIL
(when use UMI) or UNMAP,SECONDARY,QCFAIL,DUP (otherwise)]
--minLEN INT Minimum mapped length for read filtering [30]
--minMAPQ INT Minimum MAPQ for read filtering [20]
--maxPILEUP INT Maximum pileup for one site of one file (including those filtered reads),
avoids excessive memory usage; 0 means highest possible value [0]
--maxDEPTH INT Maximum depth for one site of one file (excluding those filtered reads),
avoids excessive memory usage; 0 means highest possible value [0]
--countORPHAN If use, do not skip anomalous read pairs.
Note that the "--maxFLAG" option is now deprecated, please use "--inclFLAG" or "--exclFLAG"
instead. You can easily aggregate and convert the flag mask bits to an integer by refering to:
https://broadinstitute.github.io/picard/explain-flags.html
cellsnp-lite
生活很好,有你更好