bwa 使用指南

bwa - Burrows-Wheeler Alignment Tool

#1. 介紹

BWA 是一個(gè)能將差異較小的測(cè)序數(shù)據(jù)比對(duì)到參考基因組的工具忿危。它包含三種算法:BWA-backtrack, BWA-SW 和 BWA-MEM.

  • BWA-backtrack:適用于Illumina 數(shù)據(jù),最長(zhǎng)100bp; aln/samse/sampe
  • BWA-SW:長(zhǎng)序列耍休,70bp-1Mbp纯赎;bwasw
  • BWA-MEM:長(zhǎng)序列趟脂,70bp-1Mbp司志;mem

注:但是對(duì)于高質(zhì)量的數(shù)據(jù)冰抢,通常推薦使用最新的BWA-MEM松嘶,因?yàn)樗臁⒏鼫?zhǔn)確挎扰。對(duì)于70-100bp Illumina讀數(shù)翠订,BWA-MEM也比BWA-backtrack有更好的性能。

對(duì)于所有的算法遵倦,BWA首先需要為參考基因組構(gòu)建fm-index(index 命令)尽超。使用不同的子命令調(diào)用比對(duì)算法: BWA-backtrack可調(diào)用aln/samse/sampe,BWA-SW可調(diào)用bwasw梧躺,對(duì)于BWA-MEM算法調(diào)用mem似谁。

#2. 安裝

##2.1 github安裝

git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

##2.2 自定義安裝

bunzip2 bwa-0.5.9.tar.bz2 
tar xvf bwa-0.5.9.tar
cd bwa-0.5.9
make
  • 添加到環(huán)境路徑
export PATH=$PATH:/path/to/bwa-0.5.9 
source ~/.bashrc
  • 測(cè)試
$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.5a-r405
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'. There are
      three alignment algorithms in BWA: `mem', `bwasw' and `aln/samse/sampe'. If
      you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for
      for the manual.

#3. 使用示例

  • 創(chuàng)建參考索引
bwa index -p hg19bwaidx -a bwtsw wg.fa
  • 比對(duì):使用4個(gè)CPU,reads在s_3_sequence.txt.gz文件
bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz >  s_3_sequence.txt.bwa

注:BWA 可以讀取壓縮文件;當(dāng)超過(guò)10個(gè)cpu比對(duì)時(shí)巩踏,觀察到SAM輸出有問(wèn)題秃诵。

  • 將格式轉(zhuǎn)換為sam
bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam
  • Mapping short reads to RefSeq mRNAs
bwa samse RefSeqbwaidx s_3_sequence.txt.bwa s_3_sequence.txt > s_3_sequence.txt.sam
  • Mapping long reads (454)
bwa bwasw hg19bwaidx 454seqs.txt > 454seqs.sam
#Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly contigs up to a few megabases mapped to a closely related reference genome:
  bwa mem ref.fa reads.fq > aln.sam

#Illumina single-end reads shorter than ~70bp:
  bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam

#Illumina/454/IonTorrent paired-end reads longer than ~70bp:
  bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

#Illumina paired-end reads shorter than ~70bp:
  bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
  bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam

#PacBio subreads or Oxford Nanopore reads to a reference genome:
  bwa mem -x pacbio ref.fa reads.fq > aln.sam
  bwa mem -x ont2d ref.fa reads.fq > aln.sam

#4. 常用命令行

bwa index ref.fa
bwa mem ref.fa reads.fq > aln-se.sam
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
bwa aln ref.fa short_read.fq > aln_sa.sai
bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam
bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam
bwa bwasw ref.fa long_read.fq > aln.sam

#5. 命令及參數(shù)

##5.1 index

  • FASTA格式的數(shù)據(jù)建立索引.

  • bwa index [-p prefix] [-a algoType] <in.db.fasta>

參數(shù)
-p STR 輸出前綴
-a STR BWT索引構(gòu)建算法 (is和bwtsw)

    • is 需要5.37N的內(nèi)存,其中N是數(shù)據(jù)庫(kù)的大小塞琼。IS的速度比較快菠净,但是不能用于大于2GB的數(shù)據(jù)庫(kù)。由于其簡(jiǎn)單性屈梁,IS是默認(rèn)算法嗤练。較小的參考基因組。
    • bwtsw 算法內(nèi)置于BWT-SW在讶。這種方法適用于整個(gè)人類基因組煞抬。較大的參考基因組

##5.2 mem

  • bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] [-w bandWidth] [-d zDropoff] [-r seedSplitRatio] [-c maxOcc] [-A matchScore] [-B mmPenalty] [-O gapOpenPen] [-E gapExtPen] [-L clipPen] [-U unpairPen] [-R RGline] [-v verboseLevel] db.prefix reads.fq [mates.fq]

使用BWA-MEM算法比對(duì)70bp-1Mbp的測(cè)序數(shù)據(jù);簡(jiǎn)而言之构哺,此算法是基于最大精確匹配(maximal exact matches革答,MEMs)的種子比對(duì)(seeding alignments),然后通過(guò)the affine-gap Smith-Waterman algorithm (SW)延伸種子長(zhǎng)度曙强。

當(dāng)沒(méi)有mates.fq文件以及 沒(méi)有指定參數(shù)-p時(shí)残拐,測(cè)序會(huì)被認(rèn)為是單端測(cè)序。

  • mates.fq存在碟嘴,reads.fqmates.fq中數(shù)據(jù)組成pair read溪食。
  • 使用-preads.fq文件中的2i和(2i+1)read 組成一對(duì)read.

在paired-end模式下娜扇,mem命令將從部分reads推斷read 方向和插入大小分布错沃。

BWA-MEM 進(jìn)行l(wèi)ocal alignment,因此對(duì)于一個(gè)read可能產(chǎn)生多個(gè)匹配結(jié)果雀瓢。

參數(shù)

-t INT: 線程數(shù); 默認(rèn)1

-k INT:種子最短長(zhǎng)度枢析,小于種子長(zhǎng)度的比對(duì)結(jié)果會(huì)被丟棄。默認(rèn)19

-w INT: gap 最大長(zhǎng)度刃麸。值得一提醒叁,最大gap長(zhǎng)度也受評(píng)分矩陣和命中長(zhǎng)度的影響,而不僅僅由這個(gè)參數(shù)決定泊业。默認(rèn)100

-d INT:Off-diagonal X-dropoff (Z-dropoff). [100]

-r FLOAT:a MEM longer than minSeedLen*FLOAT時(shí)把沼,重新生成種子。這是調(diào)優(yōu)性能的一個(gè)探索式參數(shù)吁伺。值越大饮睬,得到的種子越少,校準(zhǔn)速度越快箱蝠,精度越低续捂。[1.5]

-c INT:當(dāng)一個(gè)MEM在參考基因組多余INT次結(jié)果垦垂,丟棄。[10000]

-P: In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

-A INT:匹配分?jǐn)?shù)[1]

-B INT:Mismatch 罰分牙瓢;序列錯(cuò)誤率約為{.75 * exp[-log(4) * B/A]}[4]

-O INT:gap罰分劫拗。[6]

-E INT:Gap延伸罰分,A gap of length k costs O + k*E (i.e. -O is for opening a zero-length gap). [1]

-L INT:read剪切罰分矾克;進(jìn)行SW 延伸時(shí)页慷。堿基剪切后,匹配分?jǐn)?shù)減去罰分結(jié)果更差胁附,則不進(jìn)行[5]

-U INT:Penalty for an unpaired read pair酒繁。[9]

-p:假定第一個(gè)文件就包含雙端測(cè)序的所有數(shù)據(jù),查看另一個(gè)參數(shù)-P控妻。

-R STR:Complete read group header line. ’@RG\tID:foo\tSM:bar’. [null]

-T INT:匹配打分的最低閾值州袒。 [30]

-a:輸出比對(duì)的所有結(jié)果。

-C:Append append FASTA/Q comment to SAM output.

-H:Use hard clipping ’H’ in the SAM output.

-M:Mark shorter split hits as secondary (for Picard compatibility).
-v INT:Control the verbose level of the output.

##5.3 aln

  • bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai>

使用aln查找read SA坐標(biāo)弓候;在第一個(gè)種子區(qū)域的最大差異設(shè)置maxSeedDiff 郎哭,整個(gè)read序列匹配的差異閾值設(shè)置:maxDiff

-n NUM:最大的編輯距離,或者百分比菇存;[0.04]

-o INT:gap最大數(shù)量夸研;[1]

-e INT:Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]

-d INT:3端最大刪除堿基數(shù)目;[16]

-i INT:末端插入最大堿基數(shù)目依鸥;[5]

-l INT:Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]

-k INT:種子最大編輯距離亥至;[2]

-t INT:線程數(shù)

-M INT: 錯(cuò)誤罰分[3]

-O INT: Gap open penalty [11]
-E INT: Gap extension penalty [4]

-R INT: Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).

-c: Reverse query but not complement it, which is required for alignment in the color space. (Disabled since 0.6.x)

-N: 禁用迭代搜索。所有沒(méi)有超過(guò)maxDiff 的匹配都被找到

-q INT: Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]

-I:The input is in the Illumina 1.3+ read format (quality equals ASCII-64).

-B INT:Length of barcode starting from the 5’-end. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
-b: Specify the input read sequence file is the BAM format. For paired-end data, two ends in a pair must be grouped together and options -1 or -2 are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are:

bwa aln ref.fa -b1 reads.bam > 1.sai
bwa aln ref.fa -b2 reads.bam > 2.sai
bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam

-0: When -b is specified, only use single-end reads in mapping.
-1: When -b is specified, only use the first read in a read pair in mapping (skip single-end reads and the second reads).
-2:When -b is specified, only use the second read in a read pair in mapping.

##5.4 samse

  • bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>

  • 單端測(cè)序贱迟,產(chǎn)生SAM文件姐扮,重復(fù)序列隨機(jī)選擇。

-n INT:Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]

-r STR:Specify the read group in a format like ‘@RG\tID:foo\tSM:bar’. [null]

##5.5 sampe雙端測(cè)序

  • bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>
  • 雙端測(cè)序关筒,產(chǎn)生SAM文件溶握,重復(fù)序列隨機(jī)選擇杯缺。

-a INT:最大插入片段大小. [500]
-o INT:Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
-P: Load the entire FM-index into memory to reduce disk operations (base-space reads only). With this option, at least 1.25N bytes of memory are required, where N is the length of the genome.
-n INT: Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
-N INT:Maximum number of alignments to output in the XA tag for disconcordant read pairs (excluding singletons). If a read has more than INT hits, the XA tag will not be written. [10]
-r STR:指定read group ‘@RG\tID:foo\tSM:bar’. [null]

##5.6 bwasw

bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq]

Align query sequences in the in.fq file. When mate.fq is present, perform paired-end alignment. The paired-end mode only works for reads Illumina short-insert libraries. In the paired-end mode, BWA-SW may still output split alignments but they are all marked as not properly paired; the mate positions will not be written if the mate has multiple local hits.

OPTIONS:
-a INT 比對(duì)分?jǐn)?shù) [1]
-b INT 錯(cuò)配罰分 [3]
-q INT Gap 罰分 [5]
-r INT Gap 延伸罰分. The penalty for a contiguous gap of size k is q+kr. [2]
-t INT 線程數(shù) [1]
-w INT Band width in the banded alignment [33]
-T INT Minimum score threshold divided by a [37]
-c FLOAT Coefficient for threshold adjustment according to query length. Given an l-long query, the threshold for a hit to be retained is a
max{T,c*log(l)}. [5.5]
-z INT Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1]
-s INT Maximum SA interval size for initiating a seed. Higher -s increases accuracy at the cost of speed. [3]
-N INT Minimum number of seeds supporting the resultant alignment to skip reverse alignment. [5]

##5.7 SAM比對(duì)格式

aln的結(jié)果是二進(jìn)制蒸播,只適用于BWA 。BWA 將其轉(zhuǎn)換為SAM (Sequence Alignment/Map) 格式:

Col Field Description
1 QNAME Query (pair) NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIAGR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (‘=’ if same as RNAME)
8 MPOS 1-based Mate POSistion
9 ISIZE Inferred insert SIZE
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE

FLAG 內(nèi)容:

Chr Flag Description
p 0x0001 the read is paired in sequencing
P 0x0002 the read is mapped in a proper pair
u 0x0004 the query sequence itself is unmapped
U 0x0008 the mate is unmapped
r 0x0010 strand of the query (1 for reverse)
R 0x0020 strand of the mate
1 0x0040 the read is the first read in a pair
2 0x0080 the read is the second read in a pair
s 0x0100 the alignment is not primary
f 0x0200 QC failure
d 0x0400 optical or PCR duplicate

BWA 會(huì)產(chǎn)生以下內(nèi)容萍肆;X開始是BWA特有的內(nèi)容袍榆。

Tag Meaning
NM Edit distance
MD Mismatching positions/bases
AS Alignment score
BC Barcode sequence
X0 Number of best hits
X1 Number of suboptimal hits found by BWA
XN Number of ambiguous bases in the referenece
XM Number of mismatches in the alignment
XO Number of gap opens
XG Number of gap extentions
XT Type: Unique/Repeat/N/Mate-sw
XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS Suboptimal alignment score
XF Support from forward/reverse alignment
XE Number of supporting seeds

Note that XO and XG are generated by BWT search while the CIGAR string by Smith-Waterman alignment. These two tags may be inconsistent with the CIGAR string. This is not a bug.

##5.8 比對(duì)精度

禁用種子比對(duì)時(shí),BWA 會(huì)找到匹配(最大差異(maxDiff 設(shè)置)塘揣,最大gap數(shù)量(maxGapO)包雀, read任何一端都剪切超過(guò)n bp (nIndelEnd ));如果maxGapE 是正數(shù)u亲铡,Longer gaps也會(huì)被尋找到才写,但是不保證找到所有的匹配葡兑。如果使用種子匹配,

BWA 報(bào)賬第一個(gè)種子長(zhǎng)度的序列與reference 的距離不大于maxSeedDiff 赞草。

當(dāng)不使用gapped alignment時(shí)讹堤,BWA 將‘N’改為隨機(jī)核苷酸,并且與之的匹配也會(huì)被保留厨疙。

#6. 原文

Manual Reference Pages -* bwa
Elementolab/BWA tutorial

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末洲守,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子沾凄,更是在濱河造成了極大的恐慌梗醇,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件撒蟀,死亡現(xiàn)場(chǎng)離奇詭異叙谨,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)保屯,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門唉俗,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人配椭,你說(shuō)我怎么就攤上這事虫溜。” “怎么了股缸?”我有些...
    開封第一講書人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵衡楞,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我敦姻,道長(zhǎng)瘾境,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任镰惦,我火速辦了婚禮迷守,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘旺入。我一直安慰自己兑凿,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開白布茵瘾。 她就那樣靜靜地躺著礼华,像睡著了一般。 火紅的嫁衣襯著肌膚如雪拗秘。 梳的紋絲不亂的頭發(fā)上圣絮,一...
    開封第一講書人閱讀 48,970評(píng)論 1 284
  • 那天,我揣著相機(jī)與錄音雕旨,去河邊找鬼扮匠。 笑死捧请,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的棒搜。 我是一名探鬼主播血久,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼帮非!你這毒婦竟也來(lái)了氧吐?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤末盔,失蹤者是張志新(化名)和其女友劉穎筑舅,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體陨舱,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡翠拣,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了游盲。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片误墓。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖益缎,靈堂內(nèi)的尸體忽然破棺而出谜慌,到底是詐尸還是另有隱情,我是刑警寧澤莺奔,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布欣范,位于F島的核電站,受9級(jí)特大地震影響令哟,放射性物質(zhì)發(fā)生泄漏恼琼。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一屏富、第九天 我趴在偏房一處隱蔽的房頂上張望晴竞。 院中可真熱鬧,春花似錦狠半、人聲如沸噩死。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)甜滨。三九已至乐严,卻和暖如春瘤袖,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背昂验。 一陣腳步聲響...
    開封第一講書人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工捂敌, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留艾扮,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓占婉,卻偏偏與公主長(zhǎng)得像泡嘴,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子逆济,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345

推薦閱讀更多精彩內(nèi)容