Circexplorer2
參考文檔:
這個(gè)主要記錄一下使用Circexplorer2的流程
目前已經(jīng)通過(guò)之前的步驟獲得了trimmed的序列,并且qc過(guò)了霞扬,然后bowtie2糙捺,bwa芹壕,hisat栏笆,等等的索引都已經(jīng)做好。
下面參考文檔CIRCexplorer2手冊(cè)没咙,里面提供了單端和雙端的align蔬螟。這兒我使用的是GSE99531
這個(gè)是一個(gè)雙端測(cè)序的rna文庫(kù),參數(shù)如下:
Instrument: Illumina HiSeq 4000
Strategy: RNA-Seq
Source: TRANSCRIPTOMIC
Selection: cDNA
Layout: PAIRED
但是因?yàn)檫@個(gè)軟件本身默認(rèn)是用tophat處理單端測(cè)序所以就記錄一下使用铁孵。
Align
單端
From index files (bowtie1_index
is the prefix for bowtie1 index files, and bowtie2_index
is the prefix for bowtie2 index files):
CIRCexplorer2 align -G hg19_kg.gtf -i bowtie1_index -j bowtie2_index -f RNA_seq.fastq > CIRCexplorer2_align.log
-i -j 應(yīng)該是拿來(lái)指定索引文件的锭硼,用一個(gè)就行了,gtf應(yīng)該用的是knowgene那個(gè)gtf蜕劝。
雙端測(cè)序
tophat 2.1.0
tophat利用bowtie2建立的索引進(jìn)行比對(duì)檀头。這個(gè)來(lái)自circexplorer2的手冊(cè)。
tophat2 -o tophat_fusion -p 15 --fusion-search --keep-fasta-order --bowtie1 --no-coverage-search index fq1 fq2
-o | --output < string > default: ./tophat_out
輸出的文件夾路徑
-p 15
線(xiàn)程數(shù)岖沛,老生常談
--fusion-search
開(kāi)啟融合轉(zhuǎn)錄子的比對(duì)
--keep-fasta-order
對(duì)比對(duì)結(jié)果按基因組fasta文件進(jìn)行排序暑始。該參數(shù)會(huì)使輸出的SAM/BAM文件和tophat的1.41或以前版本不兼容
--bowtie1 default: bowtie2
使用Bowtie1來(lái)代替Bowtie2進(jìn)行比對(duì)。特別是使用colorspace reads時(shí)婴削,因?yàn)橹挥蠦owtie1支持廊镜,而B(niǎo)owtie2不支持。
--no-coverage-search
取消以覆蓋度為基礎(chǔ)來(lái)搜尋junctions唉俗,和下一個(gè)參數(shù)對(duì)立嗤朴,該參數(shù)為默認(rèn)參數(shù)。
我的命令如下:
tophat2 -o tophat_fusion -p 30 --fusion-search --keep-fasta-order --no-coverage-search ~/Circ/index/bowtie2/hg19/hg19 ~/Circ/fqc/SRR5635537_1_val_1.fq.gz ~/Circ/fqc/SRR5635537_2_val_2.fq.gz
tophat2 -o tophat_fusion -p 30 --fusion-search --keep-fasta-order --no-coverage-search ~/Circ/index/bowtie2/hg19/hg19 ~/Circ/fqc/SRR5635538_1_val_1.fq.gz ~/Circ/fqc/SRR5635538_2_val_2.fq.gz
腳本寫(xiě)法:
ls *.gz |cut -d "_" -f 1 | sort -u |while read id;do
ls lh ${id}_1_val_1.fq.gz ${id}_2_val_2.fq.gz
tophat2 -o tophat_fusion -p 30 --fusion-search --keep-fasta-order --no-coverage-search ~/Circ/index/bowtie2/hg19/hg19 ${id}_1_val_1.fq.gz ${id}_2_val_2.fq.gz
done
任務(wù)丟后臺(tái)留一些參考吧
linux后臺(tái)運(yùn)行&符號(hào)虫溜、nohup命令雹姊、輸出重定向等使用方法
nohup conmmand 1>nohup37.out 2>error37 &
輸出:
[2020-07-07 18:28:18] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2020-07-07 18:28:18] Checking for Bowtie
Bowtie version: 2.3.5.1
[2020-07-07 18:28:18] Checking for Bowtie index files (genome)..
[2020-07-07 18:28:18] Checking for reference FASTA file
[2020-07-07 18:28:18] Generating SAM header for /home/dick/Circ/index/bowtie2/hg19/hg19
[2020-07-07 18:28:20] Preparing reads
left reads: min. length=25, max. length=51, 65118503 kept reads (29794 discarded)
right reads: min. length=25, max. length=51, 65111529 kept reads (36768 discarded)
[2020-07-07 18:48:44] Mapping left_kept_reads to genome hg19 with Bowtie2
[2020-07-07 19:23:03] Mapping left_kept_reads_seg1 to genome hg19 with Bowtie2 (1/2)
[2020-07-07 19:27:28] Mapping left_kept_reads_seg2 to genome hg19 with Bowtie2 (2/2)
[2020-07-07 19:34:15] Mapping right_kept_reads to genome hg19 with Bowtie2
[2020-07-07 20:07:28] Mapping right_kept_reads_seg1 to genome hg19 with Bowtie2 (1/2)
[2020-07-07 20:12:43] Mapping right_kept_reads_seg2 to genome hg19 with Bowtie2 (2/2)
[2020-07-07 20:18:47] Searching for junctions via segment mapping
[2020-07-07 21:20:20] Retrieving sequences for splices
[2020-07-07 21:27:10] Indexing splices
Building a LARGE index
[2020-07-08 00:11:23] Mapping left_kept_reads_seg1 to genome segment_juncs with Bowtie2 (1/2)
[2020-07-08 00:39:47] Mapping left_kept_reads_seg2 to genome segment_juncs with Bowtie2 (2/2)
[2020-07-08 01:01:28] Joining segment hits
[2020-07-08 01:25:55] Mapping right_kept_reads_seg1 to genome segment_juncs with Bowtie2 (1/2)
[2020-07-08 01:54:26] Mapping right_kept_reads_seg2 to genome segment_juncs with Bowtie2 (2/2)
[2020-07-08 02:18:06] Joining segment hits
[2020-07-08 02:43:22] Reporting output tracks
-----------------------------------------------
[2020-07-08 03:45:32] A summary of the alignment counts can be found in tophat_fusion/37/align_summary.txt
[2020-07-08 03:45:32] Run complete: 09:17:14 elapsed
###########################################################################################
-rw-rw-r-- 1 dick dick 7.0G 7月 8 03:44 accepted_hits.bam
-rw-rw-r-- 1 dick dick 569 7月 8 03:12 align_summary.txt
-rw-rw-r-- 1 dick dick 5.7M 7月 8 03:12 deletions.bed
-rw-rw-r-- 1 dick dick 42M 7月 8 03:12 fusions.out
-rw-rw-r-- 1 dick dick 3.0M 7月 8 03:12 insertions.bed
-rw-rw-r-- 1 dick dick 12M 7月 8 03:12 junctions.bed
drwxrwxr-x 2 dick dick 4.0K 7月 8 03:44 logs
-rw-rw-r-- 1 dick dick 184 7月 7 18:48 prep_reads.info
-rw-rw-r-- 1 dick dick 243M 7月 8 03:45 unmapped.bam
報(bào)錯(cuò)踩坑記錄:
[2020-07-07 09:45:52] Beginning TopHat run (v2.1.1)
-----------------------------------------------
[2020-07-07 09:45:52] Checking for Bowtie
Bowtie version: 2.3.5.1
[2020-07-07 09:45:52] Checking for Bowtie index files (genome)..
[2020-07-07 09:45:52] Checking for reference FASTA file
[2020-07-07 09:45:52] Generating SAM header for /home/dick/Circ/index/bowtie2/hg19/hg19
[2020-07-07 09:45:54] Preparing reads
left reads: min. length=25, max. length=51, 65118503 kept reads (29794 discarded)
right reads: min. length=25, max. length=51, 65111529 kept reads (36768 discarded)
[2020-07-07 10:06:11] Mapping left_kept_reads to genome hg19 with Bowtie2
[2020-07-07 10:52:28] Mapping left_kept_reads_seg1 to genome hg19 with Bowtie2 (1/2)
[2020-07-07 10:57:41] Mapping left_kept_reads_seg2 to genome hg19 with Bowtie2 (2/2)
[2020-07-07 11:05:37] Mapping right_kept_reads to genome hg19 with Bowtie2
[2020-07-07 11:46:10] Mapping right_kept_reads_seg1 to genome hg19 with Bowtie2 (1/2)
[2020-07-07 11:50:11] Mapping right_kept_reads_seg2 to genome hg19 with Bowtie2 (2/2)
[2020-07-07 11:55:02] Searching for junctions via segment mapping
[2020-07-07 12:49:01] Retrieving sequences for splices
[2020-07-07 12:55:46] Indexing splices
Building a LARGE index
[2020-07-07 15:24:33] Mapping left_kept_reads_seg1 to genome segment_juncs with Bowtie2 (1/2)
[2020-07-07 15:46:39] Mapping left_kept_reads_seg2 to genome segment_juncs with Bowtie2 (2/2)
[2020-07-07 16:10:08] Joining segment hits
[2020-07-07 16:36:08] Mapping right_kept_reads_seg1 to genome segment_juncs with Bowtie2 (1/2)
[2020-07-07 17:00:17] Mapping right_kept_reads_seg2 to genome segment_juncs with Bowtie2 (2/2)
[2020-07-07 17:22:40] Joining segment hits
[2020-07-07 17:48:31] Reporting output tracks
[FAILED]
Error running /home/dick/anaconda3/envs/rna/bin/tophat_reports --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir tophat_fusion/37/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 --fusion-search --fusion-anchor-length 20 --fusion-min-dist 10000000 --fusion-read-mismatches 2 --fusion-multireads 2 --fusion-multipairs 2 -z gzip -p30 --inner-dist-mean 50 --inner-dist-std-dev 20 --no-closure-search --no-coverage-search --no-microexon-search --sam-header tophat_fusion/37/tmp/hg19_genome.bwt.samheader.sam --report-discordant-pair-alignments --report-mixed-alignments --samtools=/home/dick/anaconda3/envs/rna/bin/samtools_0.1.18 --bowtie2-max-penalty 6 --bowtie2-min-penalty 2 --bowtie2-penalty-for-N 1 --bowtie2-read-gap-open 5 --bowtie2-read-gap-cont 3 --bowtie2-ref-gap-open 5 --bowtie2-ref-gap-cont 3 /home/dick/Circ/index/bowtie2/hg19/hg19.fa tophat_fusion/37/junctions.bed tophat_fusion/37/insertions.bed tophat_fusion/37/deletions.bed tophat_fusion/37/fusions.out tophat_fusion/37/tmp/accepted_hits tophat_fusion/37/tmp/left_kept_reads.mapped.bam,tophat_fusion/37/tmp/left_kept_reads.candidates tophat_fusion/37/tmp/left_kept_reads.bam tophat_fusion/37/tmp/right_kept_reads.mapped.bam,tophat_fusion/37/tmp/right_kept_reads.candidates tophat_fusion/37/tmp/right_kept_reads.bam
./SeqAn-1.4.2/seqan/basic/basic_exception.h:236 FAILED! (Uncaught exception of type St12out_of_range: basic_string::substr)
這個(gè)好像是版本原因?qū)е拢裻ophat換成2.1.0就可以了衡楞。
STAR
STAR --chimSegmentMin 10 \
--runThreadN 30 \
--genomeDir ~/Circ/index/STAR/hg19 \
--readFilesCommand zcat \
--readFilesIn ~/Circ/fqc/SRR5635537_1_val_1.fq ~/Circ/fqc/SRR5635537_2_val_2.fq \
--outFileNamePrefix ~/Circ/STAR_map/37/SRR5635537_
這兒readFileCommand我一開(kāi)始寫(xiě)的gzip后來(lái)發(fā)現(xiàn)不行吱雏,就還是用它給的參數(shù)吧
回頭記錄一下輸出,這個(gè)速度真的快
STAR --chimSegmentMin 10 \
--runThreadN 30 \
--genomeDir ~/Circ/index/STAR/hg38 \
--readFilesCommand zcat \
--readFilesIn ~/Circ/fqc/SRR5635537_1_val_1.fq ~/Circ/fqc/SRR5635537_2_val_2.fq \
--outFileNamePrefix ~/Circ/STAR_map/37/SRR5635537_
parse
CIRCexplorer2 parse -t STAR SRR5635537_Chimeric.out.junction > CIRCexplorer2_parse.log
annotation
CIRCexplorer2 annotate -r hg19_ref_all.txt -g hg19.fa -b back_spliced_junction.bed -o circularRNA_known.txt > CIRCexplorer2_annotate.log
這個(gè) hg19_ref_all.txt
fetch_ucsc.py hg19 ref hg19_ref.txt
fetch_ucsc.py hg19 kg hg19_kg.txt
fetch_ucsc.py hg19 ens hg19_ens.txt
cat hg19_ref.txt hg19_kg.txt hg19_ens.txt > hg19_ref_all.txt
轉(zhuǎn)gtf
cut -f2-11 hg19_ref.txt|/home/dick/Circ/index/ref/genePredToGtf file stdin hg19_ref.gtf
# or
cut -f2-11 hg19_kg.txt|/home/dick/Circ/index/ref/genePredToGtf file stdin hg19_kg.gtf
# or
cut -f2-11 hg19_ens.txt|/home/dick/Circ/index/ref/genePredToGtf file stdin hg19_ens.gtf
cut -f2-11 hg19_ref_all.txt|/home/dick/Circ/index/ref/genePredToGtf file stdin hg19_ref_all.gtf
annotation
Annotating for Circular RNAs
CIRCexplorer2 annotate -r ~/Circ/index/ref/hg19_ref_all.txt -g ~/Circ/index/raw/hg19/hg19.fa -b back_spliced_junction.bed -o circularRNA_known.txt > CIRCexplorer2_annotate.log