比對(duì)軟件很多,首先大家去收集一下,因?yàn)槲覀兪菐Т蠹胰腴T滔驶,請(qǐng)統(tǒng)一用hisat2,并且搞懂它的用法卿闹。
直接去hisat2的主頁下載index文件即可揭糕,然后把fastq格式的reads比對(duì)上去得到sam文件。
接著用samtools把它轉(zhuǎn)為bam文件锻霎,并且排序(注意N和P兩種排序區(qū)別)索引好著角,載入IGV,再截圖幾個(gè)基因看看旋恼!
順便對(duì)bam文件進(jìn)行簡單QC吏口,參考直播我的基因組系列。
HISAT2安裝:
linux版Hisat2下載冰更,解壓产徊,可以使用了:
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip
解壓(-d 解壓到指定文件):
$ unzip -d /work/LXJ/software/ hisat2-2.1.0-Linux_x86_64.zip
檢查是否可以運(yùn)行:
$ ./hisat2
(ERR): hisat2-align exited with value 1:可以忽略
環(huán)境路徑設(shè)置:
$ sudo vi /etc/environment
添加:/work/LXJ/software/hisat2-2.1.0
$ source /etc/environment
HISAT2使用
基因組索引
自行建立基因組索引:
Command Line : hisat2-build [options]* <reference_in> <ht2_base>
Usage : hisat2-build –p 8 genome.fa genome
如果想分析關(guān)于snp、exon蜀细、剪切位點(diǎn)新的信息,HISAT2建立基因組索引時(shí)舟铜,需要加入注釋過的snp、exon奠衔、剪切位點(diǎn)后谆刨,再信息建立基因組索引;(hisat2包中有程序幫你解決)
下載基因組索引:
從HISAT2的官網(wǎng)中下載現(xiàn)成的基因組索引归斤,這樣子比較省事痊夭,也可以防止出錯(cuò):
這是老鼠的基因組索引,根據(jù)需要下載合適的版本:
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz tar zxvf mm10.tar.gz
HISAT2比對(duì)RNA-Seq到基因組:
hisat2 [options]* -x <hisat2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <hit>]
<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<SRA accession number> Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
HISAT2比對(duì):
for i in {59..62};
do
echo $i
hisat2 -t -p 8 -x /work/LXJ/Genome/M.musculus/mm10.hisat2.index/genome -1 SRR35899${i}.sra_1.fastq.gz -2 SRR35899${i}.sra_2.fastq.gz -S /mnt/hgfs/Labubuntu_data/GSE81916.RNAseq/hisat2.mm10/SRR35899${i}.sam;
done
59
Time loading forward index: 00:00:25
Time loading reference: 00:00:04
Multiseed full-index search: 00:15:41
30468155 reads; of these:
30468155 (100.00%) were paired; of these:
2722598 (8.94%) aligned concordantly 0 times
24300848 (79.76%) aligned concordantly exactly 1 time
3444709 (11.31%) aligned concordantly >1 times
----
2722598 pairs aligned concordantly 0 times; of these:
156872 (5.76%) aligned discordantly 1 time
----
2565726 pairs aligned 0 times concordantly or discordantly; of these:
5131452 mates make up the pairs; of these:
3276583 (63.85%) aligned 0 times
1334447 (26.01%) aligned exactly 1 time
520422 (10.14%) aligned >1 times
94.62% overall alignment rate
Time searching: 00:15:45
Overall time: 00:16:11
60
Time loading forward index: 00:00:29
Time loading reference: 00:00:04
Multiseed full-index search: 00:29:01
52972617 reads; of these:
52972617 (100.00%) were paired; of these:
4438954 (8.38%) aligned concordantly 0 times
42836426 (80.87%) aligned concordantly exactly 1 time
5697237 (10.76%) aligned concordantly >1 times
----
4438954 pairs aligned concordantly 0 times; of these:
268939 (6.06%) aligned discordantly 1 time
----
4170015 pairs aligned 0 times concordantly or discordantly; of these:
8340030 mates make up the pairs; of these:
5335211 (63.97%) aligned 0 times
2173091 (26.06%) aligned exactly 1 time
831728 (9.97%) aligned >1 times
94.96% overall alignment rate
Time searching: 00:29:05
Overall time: 00:29:34
61
Time loading forward index: 00:00:31
Time loading reference: 00:00:05
Multiseed full-index search: 00:21:39
36763726 reads; of these:
36763726 (100.00%) were paired; of these:
3102153 (8.44%) aligned concordantly 0 times
29382458 (79.92%) aligned concordantly exactly 1 time
4279115 (11.64%) aligned concordantly >1 times
----
3102153 pairs aligned concordantly 0 times; of these:
173349 (5.59%) aligned discordantly 1 time
----
2928804 pairs aligned 0 times concordantly or discordantly; of these:
5857608 mates make up the pairs; of these:
3596954 (61.41%) aligned 0 times
1595531 (27.24%) aligned exactly 1 time
665123 (11.35%) aligned >1 times
95.11% overall alignment rate
Time searching: 00:21:44
Overall time: 00:22:15
62
Time loading forward index: 00:00:28
Time loading reference: 00:00:05
Multiseed full-index search: 00:22:33
43802631 reads; of these:
43802631 (100.00%) were paired; of these:
3816434 (8.71%) aligned concordantly 0 times
35462440 (80.96%) aligned concordantly exactly 1 time
4523757 (10.33%) aligned concordantly >1 times
----
3816434 pairs aligned concordantly 0 times; of these:
209180 (5.48%) aligned discordantly 1 time
----
3607254 pairs aligned 0 times concordantly or discordantly; of these:
7214508 mates make up the pairs; of these:
4769954 (66.12%) aligned 0 times
1806461 (25.04%) aligned exactly 1 time
638093 (8.84%) aligned >1 times
94.56% overall alignment rate
Time searching: 00:22:38
Overall time: 00:23:06
Samtools
samtools view:
Sam文件轉(zhuǎn)換為bam文件:
for i in {59..62};
do
echo $i
samtools view -S SRR35899${i}.sam -b > SRR35899${i}.bam;
done
samtools sort:
sort對(duì)bam文件排序官册,而不是sam文件生兆;對(duì)比對(duì)結(jié)果按reads名稱排序(默認(rèn)根據(jù)染色體上對(duì)應(yīng)位置排序);此處依據(jù)reads名字排序是為了滿足后面HTseq的計(jì)算,如果此處使用默認(rèn)的chr position會(huì)增大HTseq生成count文件時(shí)的工作量。
for i in {59..62};
do
echo $i
samtools sort -n SRR35899${i}.bam -@ 8 SRR35899${i}_n.sorted;
done
默認(rèn)按照染色體位置進(jìn)行排序鸦难,而-n參數(shù)則是根據(jù)read名進(jìn)行排序; -t根吁,首先根據(jù)tag TAG排序,然后根據(jù)染色體位置或reads名字排序合蔽。
IGV查看
比對(duì)結(jié)果質(zhì)控:
常用工具有
Picard https://broadinstitute.github.io/picard/
RSeQC http://rseqc.sourceforge.net/
Qualimap http://qualimap.bioinfo.cipf.es/
此處使用RseQC击敌,RseQC下屬各式各樣的工具,并且RseQC官網(wǎng)中有測試數(shù)據(jù)和運(yùn)行實(shí)例
RseQC
安裝:pip install RseQC
可使用程序:
- bam2fq.py
- bam2wig.py
- bam_stat.py
- clipping_profile.py
- deletion_profile.py
- divide_bam.py
- FPKM_count.py
- geneBody_coverage.py
- geneBody_coverage2.py
- infer_experiment.py
- inner_distance.py
- insertion_profile.py
- junction_annotation.py
- junction_saturation.py
- mismatch_profile.py
- normalize_bigwig.py
- overlay_bigwig.py
- read_distribution.py
- read_duplication.py
- read_GC.py
- read_hexamer.py
- read_NVC.py
- read_quality.py
- RNA_fragment_size.py
- RPKM_count.py
- RPKM_saturation.py
- spilt_bam.py
- split_paired_bam.py
-
tin.py
bam_stat.py統(tǒng)計(jì)reads的mapping情況
$ bam_stat.py -i SRR3589959.sort.bam