整理ChIP-seq / CUT & Tag 分析時用到的工具楞艾。本文只對使用的工具用法進行簡單介紹参咙。
Bowtie 2是常用的基因組比對軟件。其原理在此不過多贅述硫眯,有興趣的同學(xué)可以參閱其官方文檔以及其發(fā)表的文章(https://doi.org/10.1038/nmeth.1923)蕴侧。下面簡單介紹Bowtie 2 Index和比對的命令及個人常用參數(shù)。
用法
Index
bowtie2-build [options]* <reference_in> <bt2_base>
<reference_in>:如果此處使用-f
參數(shù)两入,則指明index的參考fasta 文件净宵;如果使用-c
參數(shù),則指明index的參考序列,例如择葡,GGTCATCCT
,ACGGGTCGT
,CCGTTCTATGCGGCTTA
.
<bt2_base>:指的是生成的index文件的前綴紧武,默認情況,bowtie2-build
產(chǎn)生NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>.
--threads
使用的線程數(shù)
例子
bowtie2-build -f /public/Reference/GRCh38.primary_assembly.genome.fa --threads 24 GRCh38
上述命令使用該fasta文件/public/Reference/GRCh38.primary_assembly.genome.fa
敏储,在當(dāng)前位置產(chǎn)生前綴為GRCh38
的index文件阻星。
Alignment
單端測序比對
bowtie2 [options]* -x <bt2-idx> -U <fq> -S <sam_output> -p <threads> 2>Align.summary
-x
:參考基因組index文件的前綴(包括路徑)
-U
:單端測序的fastq文件
-S
:輸出的SAM文件,包含比對結(jié)果
-p
:使用的線程數(shù)
"2>Align.summary":將輸出到屏幕的標(biāo)準(zhǔn)誤(standard error)重導(dǎo)向到"Align.summary"文件已添,其格式通常如下
## Single-end
20000 reads; of these:
20000 (100.00%) were unpaired; of these:
1247 (6.24%) aligned 0 times
18739 (93.69%) aligned exactly 1 time
14 (0.07%) aligned >1 times
93.77% overall alignment rate
## Paired-end
10000 reads; of these:
10000 (100.00%) were paired; of these:
650 (6.50%) aligned concordantly 0 times
8823 (88.23%) aligned concordantly exactly 1 time
527 (5.27%) aligned concordantly >1 times
----
650 pairs aligned concordantly 0 times; of these:
34 (5.23%) aligned discordantly 1 time
----
616 pairs aligned 0 times concordantly or discordantly; of these:
1232 mates make up the pairs; of these:
660 (53.57%) aligned 0 times
571 (46.35%) aligned exactly 1 time
1 (0.08%) aligned >1 times
96.70% overall alignment rate
The indentation indicates how subtotals relate to t
雙端測序比對
bowtie2 [options]* -x <bt2-idx> -1 <fq1> -2 <fq2> -S <sam_output> -p <threads> 2>Align.summary
雙端比對模式基本與單端一致妥箕,只需替換fastq文件傳入的參數(shù)即可
-1
:一鏈fastq文件
-2
:二鏈fastq文件
Bowtie2 還有更多詳細的比對參數(shù)可以調(diào)整,這里就不一一介紹了酝碳。下面再介紹其輸出的SAM文件中各列的含義矾踱。
SAM OUTPUT
SAM文件的每一行代表一個reads的比對情況,至少包含了12列(tab分割)疏哗,從左往右呛讲,每一列的含義依次為:
- Read的名字
- flags之和
在bowtie2中,flags的含義為
1
The read is one of a pair
2
The alignment is one end of a proper paired-end alignment
4
The read has no reported alignments
8
The read is one of a pair and has no reported alignments
16
The alignment is to the reverse reference strand
32
The other mate in the paired-end alignment is aligned to the reverse reference strand
64
The read is mate 1 in a pair
128
The read is mate 2 in a pair
注意每個比對軟件flags的含義有所區(qū)別
- 比對到的參考基因組染色體名稱
- read 5’端比對到的參考基因組正鏈染色體坐標(biāo)(1-based)
- 比對質(zhì)量
- CIGAR字符串返奉,用以表征比對的結(jié)果
- 雙端測序中贝搁,二鏈所比對上的染色體名稱,如果與一鏈相同則為
=
芽偏,如果沒有二鏈則為*
- 雙端測序中雷逆,二鏈read 5’端比對到的參考基因組正鏈染色體坐標(biāo)(1-based),如果沒有二鏈則為
0
- 推測的一鏈與二鏈之間的片段長度污尉。該值為負表明膀哲,二鏈比對到一鏈的上游;該值為0表明二鏈沒有比對上被碗;該值為non-0表明二鏈與一鏈比對到不同的染色體上(non-0如何理解某宪?)
- Read的序列
- ASCII 編碼的read堿基質(zhì)量
- 可選的列,包括以下這些
AS:i:<N> Alignment score. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if SAM record is for an aligned read.
XS:i:<N> Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than AS:i.
YS:i:<N> Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.
XN:i:<N> The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.
XM:i:<N> The number of mismatches in the alignment. Only present if SAM record is for an aligned read.
XO:i:<N> The number of gap opens, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.
XG:i:<N> The number of gap extensions, for both read and reference gaps, in the alignment. Only present if SAM record is for an aligned read.
NM:i:<N> The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read.
YF:Z:<S> String indicating reason why the read was filtered out. See also: Filtering. Only appears for reads that were filtered out.
YT:Z:<S> Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.
MD:Z:<S> A string representation of the mismatched reference bases in the alignm
以上就是對Bowtie 2進行基因組比對的一些總結(jié)锐朴,以后有新的心得再做補充兴喂。
ref:
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#how-is-bowtie-2-different-from-bowtie-1
完。