mpileup是samtools的一個命令,用來生存bcf文件污尉,然后再用bcftools進行SNP和Indel的分析。另外蛆封,bcftools是samtools的附帶軟件。
1用法
samtools mpileup [options] in1.bam [in2.bam [...]]
Input options:
-6, --illumina1.3+ quality is in the Illumina-1.3+ encoding
-A, --count-orphans do not discard anomalous read pairs
-b, --bam-list FILE list of input BAM filenames, one per line
-B, --no-BAQ disable BAQ (per-Base Alignment Quality)
-C, --adjust-MQ INT adjust mapping quality; recommended:50, disable:0 [0]
-d, --max-depth INT max per-file depth; avoids excessive memory usage [8000]
-E, --redo-BAQ recalculate BAQ on the fly, ignore existing BQs
-f, --fasta-ref FILE faidx indexed reference sequence file
-G, --exclude-RG FILE exclude read groups listed in FILE
-l, --positions FILE skip unlisted positions (chr pos) or regions (BED)
-q, --min-MQ INT skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT skip bases with baseQ/BAQ smaller than INT [13]
-r, --region REG region in which pileup is generated
-R, --ignore-RG ignore RG tags (one BAM = one sample)
--rf, --incl-flags STR|INT required flags: skip reads with mask bits unset []
--ff, --excl-flags STR|INT filter flags: skip reads with mask bits set
[UNMAP,SECONDARY,QCFAIL,DUP]
-x, --ignore-overlaps disable read-pair overlap detection
Output options:
-o, --output FILE write output to FILE [standard output]
-O, --output-BP output base positions on reads
-s, --output-MQ output mapping quality
--output-QNAME output read names
-a output all positions (including zero depth)
-a -a (or -aa) output absolutely all positions, including unused ref. sequences
最常用的是
最常用的參數(shù)有兩個:
-f
用samtools faidx對參考序列建index.fai
文件歇盼,其他軟件也可以
-g
輸出到bcf格道宅,否則生成文本格式文件萧豆。用法和最簡單的例子如下
u
輸出不壓縮的bcf文件
$ samtools mpileup -f genome.fasta abc.bam > abc.txt
$ samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf
$ samtools mpileup -guSDf genome.fasta abc.bam |bcftools view -cvNg - > abc.vcf
2結(jié)果解釋
2.1結(jié)果共6列
[mpileup] 1 samples in 1 input files
chr1 10105 N 8 AAAAcAAA kuuu>6uK
chr1 10106 N 10 CCCCcCCCC^$c uuuukAuuAK
chr1 10107 N 10 CCCCcCCCCc upukaFfuKF
chr1 10108 N 9 CCCCcCCCC ukup7K\uK
chr1 10109 N 10 AA$AAaAAAA$a kpuuBKppFA
chr1 10110 N 7 AA$A$aAAA kupkFkp
chr1 10111 N 6 CcCCCc QuApuK
chr1 10112 N 6 CcCCCc \uKppK
chr1 10113 N 6 C$cCCCc \uKkpK
chr1 10114 N 5 tTTTt k7fpF
分別是
- 參考序列名(染色體)
- 位置
- 參考序列的堿基
- 比對上的read數(shù)目
- 比對的情況
- 比對上的堿基的質(zhì)量
2.2其中第五列相對復雜臼朗,具體解釋如下:
chr1 10056 N 7 AAAA*AA kfuufKK
chr1 2003928 N 5 CCC+6GGGCCGC+6GGGCCGC uupu<
chr1 10106 N 10 CCCCcCCCC^$c uuuukAuuAK
chr1 737247 N 3 Cc^5c KKK
chr1 1197931 N 3 T-6NNNNNNT-6NNNNNNt-6nnnnnn uuK
- 1
*
表示模糊堿基 - 2 大寫表示在正鏈不匹配
- 3 小寫表示在負鏈不匹配
- 4
^
表示匹配的堿基是一個reads的開始邻寿,^
后緊跟的ascii碼減去33代表比對質(zhì)量,修飾的是后面的堿基依溯,后面緊跟的堿基代表該read的第一個堿基 - 5
$
代表一個read的結(jié)束,該符號修飾前面的堿基 - 6 正則表達式式
+[0-9]+[ACGTNacgtn]+
代表在該位點后插入的堿基瘟则。舉例中chr1的2003928A后面有個+6GGGCCG
黎炉,很可能是indel - 7 正則表達式’-[0-9]+[ACGTNacgtn]+’代表在該位點后缺失的堿基;