Samtools使用大全

一检诗、簡介

Samtools是一個用于操作sam和bam格式文件的應(yīng)用程序集合,具有眾多的功能娜扇。 它從SAM(序列比對/映射)格式導(dǎo)入和導(dǎo)出菩混,進行排序脊岳,合并和索引喊儡,并允許快速檢索任何區(qū)域中的讀數(shù)愚臀。SAM和BAM格式的比對文件主要由bwa科雳、bowtie2范删、tophat和hisat2等序列比對工具產(chǎn)生蕾域,用于記錄測序reads在參考基因組上的映射信息。其中到旦,BAM格式文件是SAM文件的 的二進制格式旨巷,占據(jù)內(nèi)存較小且運算速度更快。

二添忘、基本用法

Samtools由一系列的子命令組成采呐,可以對sam和bam格式的文件進行不同的整合與處理。

$ samtools --help

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

2.1 構(gòu)建索引(Indexing)

  • dict -- 對fasta格式的參考基因組序列構(gòu)建字典索引

    $ samtools dict
    
    About:   Create a sequence dictionary file from a fasta file
    Usage:   samtools dict [options] <file.fa|file.fa.gz>
    
    Options: -a, --assembly STR    assembly
             -H, --no-header       do not print @HD line
             -o, --output STR      file to write out dict file [stdout]
             -s, --species STR     species
             -u, --uri STR         URI [file:///abs/path/to/file.fa]
    
    • Example:samtools dict -a GRCh38 -s "Homo sapiens" hg38.fasta -o hg38.dict
    • 構(gòu)建完成后會生成一個hg38.dict的參考基因組索引文件
    $ head hg38.dict # 查看索引文件的前十行
    
    @HD     VN:1.0  SO:unsorted
    @SQ     SN:chr1 LN:248956422    M5:2648ae1bacce4ec4b6cf337dcae37816     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr2 LN:242193529    M5:4bb4f82880a14111eb7327169ffb729b     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr3 LN:198295559    M5:a48af509898d3736ba95dc0912c0b461     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr4 LN:190214555    M5:3210fecf1eb92d5489da4346b3fddc6e     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr5 LN:181538259    M5:f7f05fb7ceea78cbc32ce652c540ff2d     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr6 LN:170805979    M5:6a48dfa97e854e3c6f186c8ff973f7dd     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr7 LN:159345973    M5:94eef2b96fd5a7c8db162c8c74378039     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr8 LN:145138636    M5:c67955b5f7815a9a1edfaa15893d3616     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    @SQ     SN:chr9 LN:138394717    M5:addd2795560986b7491c40b1faa3978a     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
    
  • faidx -- 對fasta格式的文件建立索引搁骑,生成的索引文件以.fai后綴結(jié)尾斧吐。

    $ samtools faidx
    
    Usage:   samtools faidx <file.fa|file.fa.gz> [<reg> [...]]
    
    • Example:samtools faidx hg38.fasta

    • 構(gòu)建完成后會生成一個hg38.fasta.fai的索引文件,該文件共有五列仲器,分別表示:

      第一列:序列的名稱

      第二列:序列的長度

      第三列:第一個堿基的偏移量煤率,從0開始計數(shù)

      第四列:除了最后一行外,序列中每行的堿基數(shù)

      第五列:除了最后一行外乏冀,序列中每行的長度(包括換行符)

    $ head hg38.fasta.fai
    
    chr1    248956422       6       50      51
    chr2    242193529       253935563       50      51
    chr3    198295559       500972969       50      51
    chr4    190214555       703234446       50      51
    chr5    181538259       897253299       50      51
    chr6    170805979       1082422330      50      51
    chr7    159345973       1256644435      50      51
    chr8    145138636       1419177334      50      51
    chr9    138394717       1567218749      50      51
    chr10   133797422       1708381368      50      51
    
    • 該命令也可以根據(jù)索引文件蝶糯,快速提取fasta文件中任意區(qū)域的序列文件
    # 提取1號染色體的序列信息
    $ samtools faidx ha38.fasta chr1 > hg38.chr1.fa
    
    # 提取1號染色體200-1000bp之間的序列
    $ samtools faidx hg38.fasta chr1:200-1000 > hg38.chr1_200_1000.fa
    
  • index -- 對bam格式的比對文件構(gòu)建索引,需對bam文件進行排序后才能構(gòu)建索引

    $ samtools index
    
    Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
    Options:
      -b       Generate BAI-format index for BAM files [default]
      -c       Generate CSI-format index for BAM files
      -m INT   Set minimum interval size for CSI indices to 2^INT [14]
    
    • Example: samtools index aln.sorted.bam

    • 構(gòu)建完成后將產(chǎn)生后綴為.bai的文件辆沦,用于快速的隨機處理

2.2 文件編輯(Editing)

  • calmd -- recalculate MD/NM tags and '=' bases 計算MD tag標簽值( 記錄了mismatch信息 )

    $ samtools calmd
    Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>
    
    Options:
      -e       change identical bases to '='
      -u       uncompressed BAM output (for piping)
      -b       compressed BAM output
      -S       ignored (input format is auto-detected)
      -A       modify the quality string
      -r       compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)
      -E       extended BAQ for better sensitivity but lower specificity
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example:samtools calmd -AEur aln.bam ref.fa
  • fixmate -- fix mate information 為以名稱排序定位的alignment填入配對坐標昼捍,ISIZE(inferred insert size推測的插入序列大小)和配對相應(yīng)的標簽(flag)

    $ samtools fixmate
    
    Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>
    Options:
      -r           Remove unmapped reads and secondary alignments
      -p           Disable FR proper pair check
      -c           Add template cigar ct tag
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
      -O, --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input
    file must be grouped by read name (e.g. sorted by name). Coordinated sorted
    input is not accepted.
    
    • Example:samtools fixmate in.namesorted.sam out.bam
  • reheader -- replace BAM header 替換bam文件頭注釋信息

    $ samtools reheader
    
    Usage: samtools reheader [-P] in.header.sam in.bam > out.bam
       or  samtools reheader [-P] -i in.header.sam file.bam
    
    Options:
        -P, --no-PG      Do not generate an @PG header line.
        -i, --in-place   Modify the bam/cram file directly.
                         (Defaults to outputting to stdout.)
    
  • rmdup -- remove PCR duplicates 去除bam文件中的PCR重復(fù)信息

    $ samtools rmdup
    
    Usage:  samtools rmdup [-sS] <input.srt.bam> <output.bam>
    
    Option: -s    rmdup for SE reads
            -S    treat PE reads as SE in rmdup (force -s)
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example:samtools rmdup -sS in.sorted.bam output.bam
  • targetcut -- cut fosmid regions (for fosmid pool only) 此命令僅用于從fosmid混池測序中切除fosmid克隆序列

    $ samtools targetcut
    
    Usage: samtools targetcut [-Q minQ] [-i inPen] [-0 em0] [-1 em1] [-2 em2] <in.bam>
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
      -f, --reference FILE
                   Reference sequence FASTA FILE [null]
    
  • addreplacerg -- adds or replaces RG tags 該命令用于添加或更改bam文件中的RG tag標簽值

    $ samtools addreplacerg
    
    Usage: samtools addreplacerg [options] [-r <@RG line> | -R <existing id>] [-o <output.bam>] <input.bam>
    
    Options:
      -m MODE   Set the mode of operation from one of overwrite_all, orphan_only [overwrite_all]
      -o FILE   Where to write output to [stdout]
      -r STRING @RG line text
      -R STRING ID of @RG line in existing header to use
          --input-fmt FORMAT[,OPT[=VAL]]...
                   Specify input format (SAM, BAM, CRAM)
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
      -O, --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example :samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o output.bam input.bam

2.3 文件處理(File operations)

  • collate -- shuffle and group alignments by name 根據(jù)read name信息將bam文件進行隨機打散和分組

    $ samtools collate
    
    Usage:   samtools collate [-Ou] [-n nFiles] [-c cLevel] <in.bam> <out.prefix>
    
    Options:
          -O       output to stdout
          -u       uncompressed BAM output
          -l INT   compression level [1]
          -n INT   number of temporary files [64]
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools collate -o aln.name_collated.bam aln.sorted.bam
  • cat -- concatenate BAMs 將多個bam文件合并為一個bam文件(與merge命令的區(qū)別是cat不需要將bam文件提前進行排序)

    $ samtools cat
    
    Usage: samtools cat [-h header.sam] [-o out.bam] <in1.bam> [...]
    
    • Example: samtools cat -o out.bam in1.bam in2.bam
  • merge -- merge sorted alignments 將多個已經(jīng)sort的bam文件進行合并

    $ samtools merge
    
    Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]
    
    Options:
      -n         Input files are sorted by read name
      -r         Attach RG tag (inferred from file names)
      -u         Uncompressed BAM output
      -f         Overwrite the output BAM if exist
      -1         Compress level 1
      -l INT     Compression level, from 0 to 9 [-1]
      -R STR     Merge file in the specified region STR [all]
      -h FILE    Copy the header in FILE to <out.bam> [in1.bam]
      -c         Combine @RG headers with colliding IDs [alter IDs to be distinct]
      -p         Combine @PG headers with colliding IDs [alter IDs to be distinct]
      -s VALUE   Override random seed
      -b FILE    List of input BAM filenames, one per line [null]
      -@, --threads INT
                 Number of BAM/CRAM compression threads [0]
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
      -O, --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools merge merged.bam in1.sorted.bam in2.sorted.bam
  • mpileup -- multi-way pileup 對參考基因組每個位點做堿基堆積肢扯,用于call SNP和INDEL妒茬。主要是生成BCF、VCF文件或者pileup一個或多個bam文件鹃彻。比對記錄以在@RG中的樣本名作為區(qū)分標識符郊闯。如果樣本標識符缺失,那么每一個輸入文件則視為一個樣本

    $ samtools mpileup
    
    Usage: samtools mpileup [options] in1.bam [in2.bam [...]]
    
    Input options:
      -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding
      -A, --count-orphans     do not discard anomalous read pairs
      -b, --bam-list FILE     list of input BAM filenames, one per line
      -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)
      -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]
      -d, --max-depth INT     max per-file depth; avoids excessive memory usage [250]
      -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs
      -f, --fasta-ref FILE    faidx indexed reference sequence file
      -G, --exclude-RG FILE   exclude read groups listed in FILE
      -l, --positions FILE    skip unlisted positions (chr pos) or regions (BED)
      -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
      -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
      -r, --region REG        region in which pileup is generated
      -R, --ignore-RG         ignore RG tags (one BAM = one sample)
      --rf, --incl-flags STR|INT  required flags: skip reads with mask bits unset []
      --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                                [UNMAP,SECONDARY,QCFAIL,DUP]
      -x, --ignore-overlaps   disable read-pair overlap detection
    
    Output options:
      -o, --output FILE       write output to FILE [standard output]
      -g, --BCF               generate genotype likelihoods in BCF format
      -v, --VCF               generate genotype likelihoods in VCF format
    
    Output options for mpileup format (without -g/-v):
      -O, --output-BP         output base positions on reads
      -s, --output-MQ         output mapping quality
    
    Output options for genotype likelihoods (when -g/-v is used):
      -t, --output-tags LIST  optional tags to output:
                   DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
      -u, --uncompressed      generate uncompressed VCF/BCF output
    
    SNP/INDEL genotype likelihoods options (effective with -g/-v):
      -e, --ext-prob INT      Phred-scaled gap extension seq error probability [20]
      -F, --gap-frac FLOAT    minimum fraction of gapped reads [0.002]
      -h, --tandem-qual INT   coefficient for homopolymer errors [100]
      -I, --skip-indels       do not perform indel calling
      -L, --max-idepth INT    maximum per-file depth for INDEL calling [250]
      -m, --min-ireads INT    minimum number gapped reads for indel candidates [1]
      -o, --open-prob INT     Phred-scaled gap open seq error probability [40]
      -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity
      -P, --platforms STR     comma separated list of platforms for indels [all]
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    Notes: Assuming diploid individuals.
    
    • Example: samtools mpileup -C 50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam

    • 主要參數(shù)解讀:

      -A:在檢測變異中蛛株,不忽略異常的reads對

      -C:用于調(diào)節(jié)比對質(zhì)量的系數(shù),如果reads中含有過多的錯配育拨,不能設(shè)置為零

      -D:輸出每個樣本的reads深度

      -l:BED文件或者包含區(qū)域位點的位置列表文件

      注意:位置文件包含兩列谨履,染色體和位置,從1開始計數(shù)熬丧。BED文件至少包含3列笋粟,染色體、起始和終止位置,開始端從0開始計數(shù)害捕。

      -r:在指定區(qū)域產(chǎn)生pileup绿淋,需已建立索引的bam文件,通常和-l參數(shù)一起使用

      -o/g/v:輸出文件類型(標準格式文件或者VCF尝盼、BCF文件)

      -t:設(shè)置FORMAT和INFO的列表內(nèi)容吞滞,以逗號分割

      -u:生成未壓縮的VCF和BCF文件

      -I:跳過INDEL檢測

      -m:候選INDEL的最小間隔的reads

      -f:輸入有索引文件的fasta參考序列

      -F :含有間隔reads的最小片段

    • mpileup生成的結(jié)果包含6列:1) 參考序列名,2) 位置盾沫,3) 參考堿基裁赠,4) 比對上的reads數(shù),5) 比對情況赴精,6) 比對上的堿基的質(zhì)量佩捞。其中第5列比較復(fù)雜,解釋如下:
      a) ‘.’代表與參考序列正鏈匹配蕾哟。
      b) ‘,’代表與參考序列負鏈匹配一忱。
      c) ‘ATCGN’代表在正鏈上的不匹配。
      d) ‘a(chǎn)tcgn’代表在負鏈上的不匹配谭确。
      e) ‘*’代表模糊堿基
      f) ‘’代表匹配的堿基是一個read的開始掀潮;’'后面緊跟的ascii碼減去33代表比對質(zhì)量;這兩個符號修飾的是后面的堿基琼富,其后緊跟的堿基(.,ATCGatcgNn)代表該read的第一個堿基仪吧。
      g) ‘$’代表一個read的結(jié)束,該符號修飾的是其前面的堿基鞠眉。
      h) 正則式’+[0-9]+[ACGTNacgtn]+’代表在該位點后插入的堿基薯鼠;
      i) 正則式’-[0-9]+[ACGTNacgtn]+’代表在該位點后缺失的堿基;

  • sort -- sort alignment file 對比對后的bam文件進行排序械蹋,默認按coordinate進行排序

    $ samtools sort
    
    Usage: samtools sort [options...] [in.bam]
    Options:
      -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
      -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
      -n         Sort by read name
      -o FILE    Write final output to FILE rather than standard output
      -T PREFIX  Write temporary files to PREFIX.nnnn.bam
      -@, --threads INT
                 Set number of sorting and compression threads [1]
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
      -O, --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam

    • 主要參數(shù)釋義:

      -l:設(shè)置文件壓縮等級出皇,0不壓縮,9壓縮最高

      -m:每個線程運行內(nèi)存大谢└辍(可使用K M G表示)

      -n:按照read名稱進行排序

      -o:排序后的輸出文件

      -T:PREFIX臨時文件前綴

      -@:設(shè)置排序和壓縮的線程數(shù)郊艘,默認單線程

  • split -- splits a file by read group 根據(jù)read group標簽將bam文件進行分割

    $ samtools split
    
    Usage: samtools split [-u <unaccounted.bam>[:<unaccounted_header.sam>]]
                          [-f <format_string>] [-v] <merged.bam>
    Options:
      -f STRING       output filename format string ["%*_%#.%."]
      -u FILE1        put reads with no RG tag or an unrecognised RG tag in FILE1
      -u FILE1:FILE2  ...and override the header with FILE2
      -v              verbose output
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    Format string expansions:
      %%     %
      %*     basename
      %#     @RG index
      %!     @RG ID
      %.     filename extension for output format
    
    • Example: samtools split merged.bam
  • quickcheck -- quickly check if SAM/BAM/CRAM file appears intact 檢查SAM/BAM/CRAM文件的完整性

    $ samtools quickcheck
    
    Usage: samtools quickcheck [options] <input> [...]
    Options:
      -v              verbose output (repeat for more verbosity)
    
    Notes:
    
    1. In order to use this command effectively, you should check its exit status;
       without any -v options it will NOT print any output, even when some files
       fail the check. One way to use quickcheck might be as a check that all
       BAM files in a directory are okay:
    
            samtools quickcheck *.bam && echo 'all ok' \
               || echo 'fail!'
    
       To also determine which files have failed, use the -v option:
    
            samtools quickcheck -v *.bam > bad_bams.fofn \
               && echo 'all ok' \
               || echo 'some files failed check, see bad_bams.fofn'
    
    • Example: samtools quickcheck in1.bam in2.cram
  • fastq -- converts a BAM to a FASTQ 將bam格式文件轉(zhuǎn)換為fastq文件

    $ samtools fastq
    
    Usage: samtools fastq [options...] <in.bam>
    Options:
      -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
      -1 FILE   write paired reads flagged READ1 to FILE
      -2 FILE   write paired reads flagged READ2 to FILE
      -f INT    only include reads with all bits set in INT set in FLAG [0]
      -F INT    only include reads with none of the bits set in INT set in FLAG [0]
      -n        don't append /1 and /2 to the read name
      -O        output quality in the OQ tag if present
      -s FILE   write singleton reads to FILE [assume single-end]
      -t        copy RG, BC and QT tags to the FASTQ header line
      -v INT    default quality score if not given in file [1]
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools fastq input.bam > output.fastq
  • fasta -- converts a BAM to a FASTA 將bam格式文件轉(zhuǎn)換為fasta文件

    $ samtools fasta
    
    Usage: samtools fasta [options...] <in.bam>
    Options:
      -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
      -1 FILE   write paired reads flagged READ1 to FILE
      -2 FILE   write paired reads flagged READ2 to FILE
      -f INT    only include reads with all bits set in INT set in FLAG [0]
      -F INT    only include reads with none of the bits set in INT set in FLAG [0]
      -n        don't append /1 and /2 to the read name
      -s FILE   write singleton reads to FILE [assume single-end]
      -t        copy RG, BC and QT tags to the FASTA header line
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools fasta input.bam > output.fasta

2.4 數(shù)據(jù)統(tǒng)計(Statistics)

  • bedcov -- read depth per BED region 統(tǒng)計給定region區(qū)間內(nèi)reads的深度,每個堿基的深度疊加在一起

    $ samtools bedcov
    
    Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]
    
      -Q INT       Only count bases of at least INT quality [0]
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools bedcov in.bed aln.sorted.bam
  • depth -- compute the depth 統(tǒng)計每個堿基位點的測序深度唯咬,注意計算前必須對bam文件排序并構(gòu)建索引

    $ samtools depth
    
    Usage: samtools depth [options] in1.bam [in2.bam [...]]
    Options:
       -a                  output all positions (including zero depth)
       -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
       -b <bed>            list of positions or regions
       -f <list>           list of input BAM filenames, one per line [null]
       -l <int>            read length threshold (ignore reads shorter than <int>)
       -d/-m <int>         maximum coverage depth [8000]
       -q <int>            base quality threshold
       -Q <int>            mapping quality threshold
       -r <chr:from-to>    region
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    The output is a simple tab-separated table with three columns: reference name,
    position, and coverage depth.  Note that positions with zero coverage may be
    omitted by default; see the -a option.
    
    • Example: samtools depth aln.sorted.bam
  • flagstat -- simple stats 統(tǒng)計bam文件中read的比對情況

    $ samtools flagstat
    
    Usage: samtools flagstat [--input-fmt-option OPT=VAL] <in.bam>
    
    • Example: samtools flagstat aln.sorted.bam
  • idxstats -- BAM index stats 統(tǒng)計bam索引文件里的比對信息

    $ samtools idxstats
    
    Usage: samtools idxstats <in.bam>
    
    • Example: samtools idxstats aln.sorted.bam
    • idxstats生成一個統(tǒng)計表格纱注,共有4列,分別為”序列名胆胰,序列長度狞贱,比對上的reads數(shù),unmapped reads number”蜀涨。第4列應(yīng)該是paired reads中有一端能匹配到該scaffold上瞎嬉,而另外一端不匹配到任何scaffolds上的reads數(shù)蝎毡。
  • phase -- Call and phase heterozygous SNPs

    $ samtools phase
    
    Usage:   samtools phase [options] <in.bam>
    
    Options: -k INT    block length [13]
             -b STR    prefix of BAMs to output [null]
             -q INT    min het phred-LOD [37]
             -Q INT    min base quality in het calling [13]
             -D INT    max read depth [256]
             -F        do not attempt to fix chimeras
             -A        drop reads with ambiguous phase
    
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools phase test.bam
  • stats -- generate stats (former bamcheck) 對bam文件做詳細統(tǒng)計, 統(tǒng)計結(jié)果可用mics/plot-bamstats作圖

    $ samtools stats
    
    About: The program collects statistics from BAM files. The output can be visualized using plot-bamstats.
    Usage: samtools stats [OPTIONS] file.bam
           samtools stats [OPTIONS] file.bam chr:from-to
    Options:
        -c, --coverage <int>,<int>,<int>    Coverage distribution min,max,step [1,1000,1]
        -d, --remove-dups                   Exclude from statistics reads marked as duplicates
        -f, --required-flag  <str|int>      Required flag, 0 for unset. See also `samtools flags` [0]
        -F, --filtering-flag <str|int>      Filtering flag, 0 for unset. See also `samtools flags` [0]
            --GC-depth <float>              the size of GC-depth bins (decreasing bin size increases memory requirement) [2e4]
        -h, --help                          This help message
        -i, --insert-size <int>             Maximum insert size [8000]
        -I, --id <string>                   Include only listed read group or sample name
        -l, --read-length <int>             Include in the statistics only reads with the given read length []
        -m, --most-inserts <float>          Report only the main part of inserts [0.99]
        -P, --split-prefix <str>            Path or string prefix for filepaths output by -S (default is input filename)
        -q, --trim-quality <int>            The BWA trimming parameter [0]
        -r, --ref-seq <file>                Reference sequence (required for GC-depth and mismatches-per-cycle calculation).
        -s, --sam                           Ignored (input format is auto-detected).
        -S, --split <tag>                   Also write statistics to separate files split by tagged field.
        -t, --target-regions <file>         Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.
        -x, --sparse                        Suppress outputting IS rows where there are no insertions.
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools stats aln.sorted.bam
    • 輸出的信息比較多,部分如下:
      Summary Numbers氧枣,raw total sequences沐兵,filtered sequences, reads mapped, reads mapped and paired,reads properly paired等信息
      Fragment Qualitites:根據(jù)cycle統(tǒng)計每個位點上的堿基質(zhì)量分布
      Coverage distribution:深度為1,2便监,3扎谎,,茬贵,的堿基數(shù)目
      ACGT content per cycle:ACGT在每個cycle中的比例
      Insert sizes:插入長度的統(tǒng)計
      Read lengths:read的長度分布

2.5 可視化(Viewing)

  • flags -- explain BAM flags 查看不同flag值的含義

    $ samtools flags
    
    About: Convert between textual and numeric flag representation
    Usage: samtools flags INT|STR[,...]
    
    Flags:
            0x1     PAIRED        .. paired-end (or multiple-segment) sequencing technology
            0x2     PROPER_PAIR   .. each segment properly aligned according to the aligner
            0x4     UNMAP         .. segment unmapped
            0x8     MUNMAP        .. next segment in the template unmapped
            0x10    REVERSE       .. SEQ is reverse complemented
            0x20    MREVERSE      .. SEQ of the next segment in the template is reversed
            0x40    READ1         .. the first segment in the template
            0x80    READ2         .. the last segment in the template
            0x100   SECONDARY     .. secondary alignment
            0x200   QCFAIL        .. not passing quality controls
            0x400   DUP           .. PCR or optical duplicate
            0x800   SUPPLEMENTARY .. supplementary alignment
    
    • Example: samtools flags PAIRED,UNMAP,MUNMAP

    • FLAGS:

      0x1 PAIRED paired-end (or multiple-segment) sequencing technology
      0x2 PROPER_PAIR each segment properly aligned according to the aligner
      0x4 UNMAP segment unmapped
      0x8 MUNMAP next segment in the template unmapped
      0x10 REVERSE SEQ is reverse complemented
      0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
      0x40 READ1 the first segment in the template
      0x80 READ2 the last segment in the template
      0x100 SECONDARY secondary alignment
      0x200 QCFAIL not passing quality controls
      0x400 DUP PCR or optical duplicate
      0x800 SUPPLEMENTARY supplementary alignment
  • tview -- text alignment viewer 直觀地顯示reads比對到參考基因組的情況簿透,類似于基因組瀏覽器

    $ samtools tview
    
    Usage: samtools tview [options] <aln.bam> [ref.fasta]
    Options:
       -d display      output as (H)tml or (C)urses or (T)ext
       -p chr:pos      go directly to this position
       -s STR          display only reads from this sample or group
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example: samtools tview aln.sorted.bam ref.fasta
    • 需先對bam文件進行排序并構(gòu)建索引
  • view -- SAM<->BAM<->CRAM conversion 將sam和bam文件進行格式互換

    $ samtools view
    
    Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
    
    Options:
      -b       output BAM
      -C       output CRAM (requires -T)
      -1       use fast BAM compression (implies -b)
      -u       uncompressed BAM output (implies -b)
      -h       include header in SAM output
      -H       print SAM header only (no alignments)
      -c       print only the count of matching records
      -o FILE  output file name [stdout]
      -U FILE  output reads not selected by filters to FILE [null]
      -t FILE  FILE listing reference names and lengths (see long help) [null]
      -L FILE  only include reads overlapping this BED FILE [null]
      -r STR   only include reads in read group STR [null]
      -R FILE  only include reads with read group listed in FILE [null]
      -q INT   only include reads with mapping quality >= INT [0]
      -l STR   only include reads in library STR [null]
      -m INT   only include reads with number of CIGAR operations consuming
               query sequence >= INT [0]
      -f INT   only include reads with all bits set in INT set in FLAG [0]
      -F INT   only include reads with none of the bits set in INT set in FLAG [0]
      -x STR   read tag to strip (repeatable) [null]
      -B       collapse the backward CIGAR operation
      -s FLOAT integer part sets seed of random number generator [0];
               rest sets fraction of templates to subsample [no subsampling]
      -@, --threads INT
               number of BAM/CRAM compression threads [0]
      -?       print long help, including note about region specification
      -S       ignored (input format is auto-detected)
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
      -O, --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
      -T, --reference FILE
                   Reference sequence FASTA FILE [null]
    
    • Example :samtools view -bS test.sam > test.bam

    • 重要參數(shù)解讀:

      -b:輸出bam格式

      -C:輸出CRAM文件

      -1:快速壓縮(需要-b)

      -u:輸出未壓縮的bam文件,節(jié)約時間解藻,占據(jù)較多磁盤空間(需要-b)

      -h:默認輸出sam文件不帶表頭老充,該參數(shù)設(shè)定后輸出帶表頭信息

      -H:僅僅輸出表頭信息

      -c:僅打印匹配數(shù)

      -o:輸出文件(stdout標準輸出)

      -U:輸出沒有經(jīng)過過濾選擇的reads

      -t:制表分隔符文件(需要提供額外的參考數(shù)據(jù),比如參考基因組的索引)

      -L:僅包括和bed文件有重疊的reads

      -r:僅輸出在STR讀段組中的reads

      -R:僅輸出特定reads

      -q:定位的質(zhì)量大于INT[默認0]

      -l:僅輸出在STR 庫中的reads

      -F:獲得比對上(mapped)的過濾設(shè)置[默認0]

      -f:獲得未必對上(unmapped)的過濾設(shè)置[默認0]

      -T:使用fasta格式的參考序列

  • depad -- convert padded BAM to unpadded BAM

    $ samtools depad
    
    Usage:   samtools depad <in.bam>
    
    Options:
      -s           Output is SAM (default is BAM)
      -S           Input is SAM (default is BAM)
      -u           Uncompressed BAM output (can't use with -s)
      -1           Fast compression BAM output (can't use with -s)
      -T, --reference FILE
                   Padded reference sequence file [null]
      -o FILE      Output file name [stdout]
      -?           Longer help
          --input-fmt-option OPT[=VAL]
                   Specify a single input file format option in the form
                   of OPTION or OPTION=VALUE
          --output-fmt FORMAT[,OPT[=VAL]]...
                   Specify output format (SAM, BAM, CRAM)
          --output-fmt-option OPT[=VAL]
                   Specify a single output file format option in the form
                   of OPTION or OPTION=VALUE
    
    • Example: samtools depad input.bam

三螟左、常用示例

  1. sam/bam格式文件互換
$ samtools view -bS -1 test.sam > test.bam # sam轉(zhuǎn)bam
$ samtools view -h test.bam > test.sam # bam轉(zhuǎn)sam
  1. 提取比對到參考基因組上的數(shù)據(jù)
$ samtools view -bF 4 test.bam > test.F.bam
  1. 提取沒有比對到參考基因組上的數(shù)據(jù)
$ samtools view -bf 4 test.bam > test.f.bam
  1. 雙端reads都比對到參考基因組上的數(shù)據(jù)
$ samtools view -bF 12 test.bam > test.12.bam
  1. 單端reads1比對到參考基因組上的數(shù)據(jù)
samtools view -bF 4 -f 8  test .bam > test1.bam
  1. 單端reads2比對到參考基因組上的數(shù)據(jù)
$ samtools view -bF 8 -f 4 test.bam > test2.bam
  1. 提取bam文件中比對到scaffold1上的比對結(jié)果啡浊,并保存到sam文件格式
$ samtools view abc.bam scaffold1 > scaffold1.sam
  1. 提取scaffold1上能比對到30k到100k區(qū)域的比對結(jié)果
$ samtools view abc.bam scaffold1:30000-100000 > scaffold1_30k-100k.sam
  1. 根據(jù)fasta文件,將 header 加入到 sam 或 bam 文件中
$ samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam
  1. call SNP和INDEL等變異信息
$ samtools mpileup -f genome.fasta abc.bam > abc.txt
$ samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf
$ samtools mpileup -guSDf genome.fasta abc.bam | \
           bcftools view -cvNg - > abc.vcf

參考來源:
http://www.htslib.org/doc/samtools.html
https://www.cnblogs.com/xiaofeiIDO/p/6805373.html
https://blog.csdn.net/sunchengquan/article/details/85176940

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末胶背,一起剝皮案震驚了整個濱河市巷嚣,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌钳吟,老刑警劉巖廷粒,帶你破解...
    沈念sama閱讀 216,372評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異红且,居然都是意外死亡坝茎,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評論 3 392
  • 文/潘曉璐 我一進店門暇番,熙熙樓的掌柜王于貴愁眉苦臉地迎上來嗤放,“玉大人,你說我怎么就攤上這事壁酬〈巫茫” “怎么了?”我有些...
    開封第一講書人閱讀 162,415評論 0 353
  • 文/不壞的土叔 我叫張陵舆乔,是天一觀的道長岳服。 經(jīng)常有香客問我,道長蜕煌,這世上最難降的妖魔是什么派阱? 我笑而不...
    開封第一講書人閱讀 58,157評論 1 292
  • 正文 為了忘掉前任,我火速辦了婚禮斜纪,結(jié)果婚禮上贫母,老公的妹妹穿的比我還像新娘。我一直安慰自己盒刚,他們只是感情好腺劣,可當我...
    茶點故事閱讀 67,171評論 6 388
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著因块,像睡著了一般橘原。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上涡上,一...
    開封第一講書人閱讀 51,125評論 1 297
  • 那天趾断,我揣著相機與錄音,去河邊找鬼吩愧。 笑死芋酌,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的雁佳。 我是一名探鬼主播脐帝,決...
    沈念sama閱讀 40,028評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼糖权!你這毒婦竟也來了堵腹?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,887評論 0 274
  • 序言:老撾萬榮一對情侶失蹤星澳,失蹤者是張志新(化名)和其女友劉穎疚顷,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體禁偎,經(jīng)...
    沈念sama閱讀 45,310評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡腿堤,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,533評論 2 332
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了届垫。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片释液。...
    茶點故事閱讀 39,690評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖装处,靈堂內(nèi)的尸體忽然破棺而出误债,到底是詐尸還是另有隱情,我是刑警寧澤妄迁,帶...
    沈念sama閱讀 35,411評論 5 343
  • 正文 年R本政府宣布寝蹈,位于F島的核電站,受9級特大地震影響登淘,放射性物質(zhì)發(fā)生泄漏箫老。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,004評論 3 325
  • 文/蒙蒙 一黔州、第九天 我趴在偏房一處隱蔽的房頂上張望耍鬓。 院中可真熱鬧阔籽,春花似錦、人聲如沸牲蜀。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽涣达。三九已至在辆,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間度苔,已是汗流浹背匆篓。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評論 1 268
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留寇窑,地道東北人鸦概。 一個月前我還...
    沈念sama閱讀 47,693評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像疗认,于是被迫代替她去往敵國和親完残。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 44,577評論 2 353