10X單細(xì)胞空間數(shù)據(jù)分析之SNP檢測(cè)篇

作者垂蜗,Evil Genius

單細(xì)胞測(cè)序已成為在遺傳學(xué)德挣、轉(zhuǎn)錄組學(xué)和表觀遺傳學(xué)等不同水平上解開細(xì)胞群體異質(zhì)性的有力技術(shù)屁柏,因此在基礎(chǔ)研究和臨床轉(zhuǎn)化中具有深遠(yuǎn)的意義怪嫌。細(xì)胞基因型主要通過單細(xì)胞DNA-seq (scDNA-seq)檢測(cè)腫瘤中的體細(xì)胞突變,將細(xì)胞聚類成克隆并推斷其進(jìn)化動(dòng)力學(xué)來研究探遵。最近窟赏,越來越多的證據(jù)表明,在其他單細(xì)胞探針中箱季,包括scATAC-seq和全長(zhǎng)scRNA-seq(例如SMART-seq2)涯穷,在核和線粒體基因組上也可以觀察到體細(xì)胞突變的一個(gè)子集。另一方面藏雏,生殖系變異(又稱單核苷酸多態(tài)性拷况,SNPs)在單細(xì)胞測(cè)序數(shù)據(jù)中被更廣泛地觀察到,即使是在液滴為基礎(chǔ)的平臺(tái)上诉稍,如10XGenomics蝠嘉,這要?dú)w功于龐大的候選列表[在人群中約有700萬個(gè)snp,頻率為0.5 %]杯巨。種系snp不僅是perfect natural barcodes when multiplexing cells from multiple individuals,而且在通過細(xì)胞eQTL分析或等位基因特異性表達(dá)和拷貝數(shù)變化引起的等位基因失衡暗示功能調(diào)控方面也具有重要意義努酸。

工具介紹

Cellsnp-lite是在C/ c++中實(shí)現(xiàn)的服爷,并執(zhí)行每個(gè)細(xì)胞基因分型,supporting both with (mode 1) and without (mode 2) given
SNPs
。在后一種情況下仍源,雜合snp將被自動(dòng)檢測(cè)心褐。Cellsnp-lite適用于基于液滴的(例如10XGenomics數(shù)據(jù))和well-based的平臺(tái)(例如SMART-seq2數(shù)據(jù))

Cellsnp-lite需要以bam/sam/cram文件格式的對(duì)齊讀取作為輸入笼踩。細(xì)胞標(biāo)簽可以在多個(gè)bam文件(基于液滴的平臺(tái))中的細(xì)胞標(biāo)簽中進(jìn)行編碼逗爹,也可以由每個(gè)細(xì)胞bam文件(基于良好的平臺(tái))指定。這種靈活性還允許cellsnp-lite在bulk樣品上無縫工作嚎于,例如bulk RNA-seq掘而,只需將其視為基礎(chǔ)良好的“細(xì)胞”。

pileup是在每個(gè)基因組位置進(jìn)行的于购,對(duì)于給定的snp(模式1)或整個(gè)染色體(即模式2)袍睡。將獲取覆蓋查詢位置的所有讀取。默認(rèn)情況下肋僧,丟棄那些低對(duì)齊質(zhì)量的讀取斑胜,包括MAPQ < 20,對(duì)齊長(zhǎng)度< 30 nt和FLAG與UNMAP, SECONDARY, QCFAIL(和DUP嫌吠,如果UMI不適用)止潘。然后,對(duì)于基于液滴的樣本(模式1a或2a)辫诅,我們通過哈希圖將所有這些讀取分配到每個(gè)細(xì)胞中凭戴,或者對(duì)于well-based的細(xì)胞(模式1b或2b)直接分配。在每個(gè)cell中泥栖,計(jì)算所有A, C, G, T, N堿基的umi(如果存在)或讀取簇宽。如果給定(即模式1),則從輸入snp中取出REF和ALT等位基因吧享,否則選擇REF數(shù)量最高的堿基魏割,ALT數(shù)量次之(模式2)。

當(dāng)給定snp(模式1)時(shí)钢颂,cellsnp-lite將通過將輸入snp按順序等分入多個(gè)線程來執(zhí)行并行計(jì)算钞它。否則,在模式2中殊鞭,cellsnp-lite將通過分裂列出的染色體并行計(jì)算遭垛,每條線程對(duì)應(yīng)一條染色體
在上述所有情況中操灿,cellsnp-lite輸出可選等位基因锯仪、深度(即REF和ALT等位基因)和其他等位基因的稀疏矩陣。如果添加參數(shù)' -genotype '趾盐, cellsnp-lite將使用表1所示的誤差模型進(jìn)行基因分型庶喜,并以VCF格式輸出細(xì)胞作為樣本小腊。

安裝

conda install cellsnp-lite

運(yùn)行

Mode 1: pileup with given SNPs
Mode 1a: droplet-based single cells

Require: a single BAM/SAM/CRAM file, e.g., from CellRanger; a list of cell barcodes, e.g., barcodes.tsv file in the CellRanger directory, outs/filtered_gene_bc_matrices/; a VCF file for common SNPs. This mode is recommended comparing to mode 2, if a list of common SNP is known, e.g., human (see Candidate_SNPs)

cellsnp-lite -s $BAM -b $BARCODE -O $OUT_DIR -R $REGION_VCF -p 20 --minMAF 0.1 --minCOUNT 20 --gzip

As shown in the above command line, we recommend filtering SNPs with <20UMIs or <10% minor alleles for downstream donor deconvolution, by adding --minMAF 0.1 --minCOUNT 20.
Besides, special care needs to be taken when filtering PCR duplicates for scRNA-seq data by including DUP bit in exclFLAG, for the upstream pipeline may mark each extra read sharing the same CB/UMI pair as PCR duplicate, which will result in most variant data being lost. Due to the reason above, cellsnp-lite by default uses a non-DUP exclFLAG value to include PCR duplicates for scRNA-seq data when UMItag is turned on.

Mode 1b: well-based single cells or bulk

Require: one or multiple BAM/SAM/CRAM files (bulk or smart-seq), their according sample ids (optional), and a VCF file for a list of common SNPs. BAM/SAM/CRAM files can be input in comma separated way (-s) or in a list file (-S).

cellsnp-lite -s $BAM1,$BAM2 -I sample_id1,sample_id2 -O $OUT_DIR -R $REGION_VCF -p 20 --cellTAG None --UMItag None --gzip

cellsnp-lite -S $BAM_list_file -i sample_list_file -O $OUT_DIR -R $REGION_VCF -p 20 --cellTAG None --UMItag None --gzip
Mode 2: pileup whole chromosome(s) without given SNPs

For mode2, by default it runs on chr1 to 22 on human. For mouse, you need to specify it to 1,2,…,19 (replace the ellipsis).

This mode may output false positive SNPs, for example somatic variants or falses caused by RNA editing. These false SNPs are probably not consistent in all cells within one individual, hence confounding the demultiplexing. Nevertheless, for species, e.g., zebrafish, without a good list of common SNPs, this strategy is still worth a good try.

Mode 2a: droplet based single cells without given SNPs
# 10x sample with cell barcodes
cellsnp-lite -s $BAM -b $BARCODE -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --gzip

Add --chrom if you only want to genotype specific chromosomes, e.g., 1,2, or chrMT.

Mode 2b: well-based single cells or bulk without SNPs
# a bulk sample without cell barcodes and UMI tag
cellsnp-lite -s $bulkBAM -I Sample0 -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --cellTAG None --UMItag None --gzip

Output

cellsnp-lite outputs at least 5 files listed below (with --gzip):

  • cellSNP.base.vcf.gz: a VCF file listing genotyped SNPs and aggregated AD & DP infomation (without GT).

  • cellSNP.samples.tsv: a TSV file listing cell barcodes or sample IDs.

  • cellSNP.tag.AD.mtx: a file in “Matrix Market exchange formats”, containing the allele depths of the alternative (ALT) alleles.

  • cellSNP.tag.DP.mtx: a file in “Matrix Market exchange formats”, containing the sum of allele depths of the reference and alternative alleles (REF + ALT).

  • cellSNP.tag.OTH.mtx: a file in “Matrix Market exchange formats”, containing the sum of allele depths of all the alleles other than REF and ALT.

If --genotype option was specified, then cellsnp-lite would output the cellSNP.cells.vcf.gz file, a VCF file listing genotyped SNPs and AD & DP & genotype (GT) information for each cell or sample.

Full parameters
Usage:   cellsnp-lite [options]

Options:
  -s, --samFile STR    Indexed sam/bam file(s), comma separated multiple samples.
                       Mode 1a & 2a: one sam/bam file with single cell.
                       Mode 1b & 2b: one or multiple bulk sam/bam files,
                       no barcodes needed, but sample ids and regionsVCF.
  -S, --samFileList FILE   A list file containing bam files, each per line, for Mode 1b & 2b.
  -O, --outDir DIR         Output directory for VCF and sparse matrices.
  -R, --regionsVCF FILE    A vcf file listing all candidate SNPs, for fetch each variants.
                           If None, pileup the genome. Needed for bulk samples.
  -T, --targetsVCF FILE    Similar as -R, but the next position is accessed by streaming rather
                           than indexing/jumping (like -T in samtools/bcftools mpileup).
  -b, --barcodeFile FILE   A plain file listing all effective cell barcode.
  -i, --sampleList FILE    A list file containing sample IDs, each per line.
  -I, --sampleIDs STR      Comma separated sample ids.
  -V, --version            Print software version and exit.
  -h, --help               Show this help message and exit.

Optional arguments:
  --genotype           If use, do genotyping in addition to counting.
  --gzip               If use, the output files will be zipped into BGZF format.
  --printSkipSNPs      If use, the SNPs skipped when loading VCF will be printed.
  -p, --nproc INT      Number of subprocesses [1]
  -f, --refseq FILE    Faidx indexed reference sequence file. If set, the real (genomic)
                       ref extracted from this file would be used for Mode 2 or for the
                       missing REFs in the input VCF for Mode 1.
  --chrom STR          The chromosomes to use, comma separated [1 to 22]
  --cellTAG STR        Tag for cell barcodes, turn off with None [CB]
  --UMItag STR         Tag for UMI: UB, Auto, None. For Auto mode, use UB if barcodes are inputted,
                       otherwise use None. None mode means no UMI but read counts [Auto]
  --minCOUNT INT       Minimum aggragated count [20]
  --minMAF FLOAT       Minimum minor allele frequency [0.00]
  --doubletGL          If use, keep doublet GT likelihood, i.e., GT=0.5 and GT=1.5.

Read filtering:
  --inclFLAG STR|INT   Required flags: skip reads with all mask bits unset []
  --exclFLAG STR|INT   Filter flags: skip reads with any mask bits set [UNMAP,SECONDARY,QCFAIL
                       (when use UMI) or UNMAP,SECONDARY,QCFAIL,DUP (otherwise)]
  --minLEN INT         Minimum mapped length for read filtering [30]
  --minMAPQ INT        Minimum MAPQ for read filtering [20]
  --maxPILEUP INT      Maximum pileup for one site of one file (including those filtered reads),
                       avoids excessive memory usage; 0 means highest possible value [0]
  --maxDEPTH INT       Maximum depth for one site of one file (excluding those filtered reads),
                       avoids excessive memory usage; 0 means highest possible value [0]
  --countORPHAN        If use, do not skip anomalous read pairs.

Note that the "--maxFLAG" option is now deprecated, please use "--inclFLAG" or "--exclFLAG"
instead. You can easily aggregate and convert the flag mask bits to an integer by refering to:
https://broadinstitute.github.io/picard/explain-flags.html

cellsnp-lite

生活很好,有你更好

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
禁止轉(zhuǎn)載久窟,如需轉(zhuǎn)載請(qǐng)通過簡(jiǎn)信或評(píng)論聯(lián)系作者秩冈。
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市斥扛,隨后出現(xiàn)的幾起案子入问,更是在濱河造成了極大的恐慌,老刑警劉巖稀颁,帶你破解...
    沈念sama閱讀 206,013評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件芬失,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡峻村,警方通過查閱死者的電腦和手機(jī)麸折,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來粘昨,“玉大人垢啼,你說我怎么就攤上這事≌派觯” “怎么了芭析?”我有些...
    開封第一講書人閱讀 152,370評(píng)論 0 342
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)吞瞪。 經(jīng)常有香客問我馁启,道長(zhǎng),這世上最難降的妖魔是什么芍秆? 我笑而不...
    開封第一講書人閱讀 55,168評(píng)論 1 278
  • 正文 為了忘掉前任惯疙,我火速辦了婚禮,結(jié)果婚禮上妖啥,老公的妹妹穿的比我還像新娘霉颠。我一直安慰自己,他們只是感情好荆虱,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評(píng)論 5 371
  • 文/花漫 我一把揭開白布蒿偎。 她就那樣靜靜地躺著,像睡著了一般怀读。 火紅的嫁衣襯著肌膚如雪诉位。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,954評(píng)論 1 283
  • 那天菜枷,我揣著相機(jī)與錄音苍糠,去河邊找鬼。 笑死啤誊,一個(gè)胖子當(dāng)著我的面吹牛椿息,可吹牛的內(nèi)容都是我干的歹袁。 我是一名探鬼主播坷衍,決...
    沈念sama閱讀 38,271評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼寝优,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了枫耳?” 一聲冷哼從身側(cè)響起乏矾,我...
    開封第一講書人閱讀 36,916評(píng)論 0 259
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎迁杨,沒想到半個(gè)月后钻心,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,382評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡铅协,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評(píng)論 2 323
  • 正文 我和宋清朗相戀三年捷沸,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片狐史。...
    茶點(diǎn)故事閱讀 37,989評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡痒给,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出骏全,到底是詐尸還是另有隱情苍柏,我是刑警寧澤,帶...
    沈念sama閱讀 33,624評(píng)論 4 322
  • 正文 年R本政府宣布姜贡,位于F島的核電站试吁,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏楼咳。R本人自食惡果不足惜熄捍,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望母怜。 院中可真熱鬧余耽,春花似錦、人聲如沸糙申。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽柜裸。三九已至缕陕,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間疙挺,已是汗流浹背扛邑。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評(píng)論 1 260
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留铐然,地道東北人蔬崩。 一個(gè)月前我還...
    沈念sama閱讀 45,401評(píng)論 2 352
  • 正文 我出身青樓恶座,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國和親沥阳。 傳聞我的和親對(duì)象是個(gè)殘疾皇子跨琳,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容