在RNA-seq中針對circRNA的鑒定與定量的綜合性分析方法——CIRIquant
CIRIquant 與其他同類算法相比荒叶,其在鑒定circRNA的過程中降低了假陽性率。在circRNA的定量和差異表達(dá)分析方面交胚,CIRIquant也考慮到RNase R處理不均等因素對最終結(jié)果的影響中鼠,通過qPCR驗(yàn)證其結(jié)果相較于其他方法擁有更高的準(zhǔn)確度胯究。同時色难,CIRIquant還提供對circRNA與線性轉(zhuǎn)錄本比例的分析泼舱,為circRNA生物發(fā)生和調(diào)控的相關(guān)研究提供基礎(chǔ)。
鑒定原理
- 首先將reads使用hisat2與參考基因組進(jìn)行比對枷莉,使用CIRI2或者其他軟件對circRNA進(jìn)行鑒定獲得潛在的circRNA娇昙;
- 為了對circRNA的表達(dá)水平進(jìn)行精準(zhǔn)的定量并篩選BSJs的假陽性,我們將BSJ區(qū)域的兩個全長序列連接構(gòu)建一個偽circRNA參考序列笤妙;
- 然后將潛在的circRNA重新比對到偽序列上冒掌,確定BSJ reads 是否可以與BSJ區(qū)域完全線性對齊噪裕。
- 結(jié)合與基因組和偽circRNA序列比對結(jié)果,可以通過計(jì)算BSJ上的環(huán)裝剪切junction reads 的比列來確定每個circRNA的junction 率股毫;然后使用RNA-seq分析流程獲得轉(zhuǎn)錄本水平表達(dá)量信息膳音。
1. 安裝
1.1 CIRIquant依賴的軟件與python包:
Softwares:
- bwa
- hisat2
- stringtie
- samtools >= 1.9 (older version of samtools may use deprecated parameters in sort and index commands)
Python packages:
- PyYAML
- argparse
- pysam
- numpy
- scipy
- scikit-learn
1.2 使用源代碼安裝
從git-hub或者SourceForge處獲得最新版本
# create and activate virtual env
pip install virtualenv #OR conda install virtualenv
virtualenv -p /path/to/your/python2/executable venv
source ./venv/bin/activate
# Install CIRIquant and its requirement automatically
tar zxvf CIRIquant.tar.gz
cd CIRIquant
python setup.py install
# Manual installation of required pacakges is also supported
pip install -r requirements.txt #OR conda install --yes --file requirements.txt
1.3
直接使用pip安裝
pip install CIRIquant
2 使用
2.1 circRNA的定量
參數(shù)使用:
Usage:
CIRIquant [options] --config <config> -1 <m1> -2 <m2>
<config> Config file
<m1> Input mate1 reads (for paired-end data)
<m2> Input mate2 reads (for paired-end data)
Options (defaults in parentheses):
-v Run in verbose mode
-o, --out Output directory (default: current directory)
-e, --log Specific log file (default: sample_prefix.log)
-p, --prefix Output sample prefix (default: input sample name)
-t, --threads Number of CPU threads to use (defualt: 4)
-a, --anchor Minimum anchor length for junction alignment (default: 5)
-l, --library-type Library type, 0: unstranded, 1: read1 match the sense strand, 2: read1 match the antisense strand (default: 0)
--bed User provided Back-Spliced Junction Site in BED format
--circ circRNA prediction results from other tools
--tool Tool name, required when --circ is specified ([CIRI2/CIRCexplorer2/DCC/KNIFE/MapSplice/UROBORUS/circRNA_finder/find_circ])
--RNaseR CIRIquant output file of RNase R data (required for RNase R correction)
--bam Specific hisat2 alignment bam file against reference genome
--no-gene Skip StringTie estimation of gene abundance
注意:
- 目前,–circ 和–tool 選項(xiàng)支持來自
CIRI2
/CIRCexplorer2
/DCC
/KNIFE
/MapSplice
/UROBORUS
/circRNA_finder
/find_cir
的結(jié)果 - 對于DCC和circRNA_finder等工具铃诬,需手動刪除連接位置相同但鏈相反的重復(fù)環(huán)狀rna祭陷。
- 基因表達(dá)值需要進(jìn)行歸一化,如果之后需要運(yùn)行DE分析氧急,不要使用-
--no-gene
颗胡。
CIRIquant需要輸入一個配置文件
// Example of config file
name: hg19
tools:
bwa: path/bwa
hisat2: path/hisat2
stringtie: path/stringtie
samtools: path/samtools
reference:
fasta: path/hg19.fa
gtf: path/gencode.v19.annotation.gtf
bwa_index: path//hg19
hisat_index: path/hg19
Key | Description |
---|---|
name | the name of config file |
bwa | the path of bwa |
hisat2 | the path of hisat2 |
stringtie | the path of stringite |
samtools | the path of samtools, samtools version below 1.3.1 is not supported |
fasta | reference genome fasta, a fai index by samtools faidx is also needed under the same directory |
gtf | annotation file of reference genome in GTF/GFF3 format |
bwa_index | prefix of BWA index for reference genome |
hisat_index | prefix of HISAT2 index for reference genome |
對于線下提供的circRNA定量表需要有junction位點(diǎn)的bed文件毫深,第4列的格式必須為“chrom:start|end”
吩坝。例如:
chr1 10000 10099 chr1:10000|10099 . +
chr1 31000 31200 chr1:31000|31200 . -
Usage
推薦:使用CIRI2進(jìn)行circRMA的預(yù)測
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
-o ./test \
-p test
使用提供的bed文件時:
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
-o ./test \
-p test \
--bed your_circRNAs.bed
使用其它鑒定工具時:
例如使用find_circ鑒定的circRNA結(jié)果
CIRIquant -t 4 \
-1 ./test_1.fq.gz \
-2 ./test_2.fq.gz \
--config ./chr1.yml \
-o ./test \
-p test \
--circ find_circ_results.txt \
--tool find_circ
輸出結(jié)果
The main output of CIRIquant is a GTF file, that contains detailed information of BSJ and FSJ reads of circRNAs and annotation of circRNA back-spliced regions in the attribute columns
CIRIquant 輸出的是一個GTF文件,其中包括circRNA的BSJ和FSJ reads 的詳細(xì)信息哑蔫,以及屬性列中對circRNA反向拼接區(qū)域的注釋钉寝。
每一列的信息
column | name | description |
---|---|---|
1 | chrom | chromosome / contig name |
2 | source | CIRIquant |
3 | type | circRNA |
4 | start | 5' back-spliced junction site |
5 | end | 3' back-spliced junction site |
6 | score | CPM of circRNAs (#BSJ / #Mapped reads) |
7 | strand | strand information |
8 | . | . |
9 | attributes | attributes seperated by semicolon |
attributes包含的key與value:
key | description |
---|---|
circ_id | name of circRNA |
circ_type | circRNA types: exon / intron / intergenic |
bsj | number of bsj reads |
fsj | number of fsj reads |
junc_ratio | circular to linear ratio: 2 * bsj / ( 2 * bsj + fsj) |
rnaser_bsj | number of bsj reads in RNase R data (only when --RNaseR is specificed) |
rnaser_fsj | number of fsj reads in RNase R data (only when --RNaseR is specificed) |
gene_id | ensemble id of host gene |
gene_name | HGNC symbol of host gene |
gene_type | type of host gene in gtf file |
2.2 RNase R效應(yīng)矯正
當(dāng)擁有RNase R處理和未經(jīng)處理的樣本時,CIRIquant 可以評估RNase R數(shù)據(jù)中處理前的circrna的表達(dá)水平闸迷。
為了去除RNase R處理效果嵌纲,需要兩個步驟:
- 對RNase R處理過的樣品運(yùn)行CIRIquant。
- 對未經(jīng)處理的總RNA樣本運(yùn)行CIRIquant腥沽,
--RNaseR
使用之前Step1中的輸出gtf文件逮走。
然后,CIRIquant將輸出RNaseR數(shù)據(jù)中鑒定到的circrna的表達(dá)水平今阳,標(biāo)題行將包含額外的RNaseR處理效率信息师溅。
Usage:
# Step1. Run CIRIquant with RNase R treated data
CIRIquant --config ./hg19.yml \
-1 ./RNaseR_treated_1.fq.gz \
-2 ./RNaseR_treated_2.fq.gz \
--no-gene \
-o ./RNaseR_treated \
-p RNaseR_treated \
-t 6
# Step2. Run CIRIquant with untreated total RNA
CIRIquant --config ./hg19.yml \
-1 ./TotalRNA_1.fq.gz \
-2 ./TotalRNA_2.fq.gz \
-o ./TotalRNA \
-p TotalRNA \
-t 6 \
--RNaseR ./RNaseR_treated/RNaseR_treated.gtf
2.3 差異表達(dá)分析
針對無生物學(xué)重復(fù)
對于沒有重復(fù)的樣品,差異表達(dá)和差異剪接分析使用 CIRI_DE
Usage:
CIRI_DE [options] -n <control> -c <case> -o <out>
<control> CIRIquant result of control sample
<case> CIRIquant result of treatment cases
<out> Output file
Options (defaults in parentheses):
-p p value threshold for DE and DS score calculation (default: 0.05)
-t numer of threads (default: 4)
Example usage:
CIRI_DE -n control.gtf -c case.gtf -o CIRI_DE.tsv
輸出結(jié)果
column | name | description |
---|---|---|
1 | circRNA_ID | circRNA identifier |
2 | Case_BSJ | number of BSJ reads in case |
3 | Case_FSJ | number of FSJ reads in case |
4 | Case_Ratio | junction ratio in case |
5 | Ctrl_BSJ | number of BSJ reads in control |
6 | Ctrl_FSJ | number of FSJ reads in control |
7 | Ctrl_Ratio | junction ratio in control |
8 | DE_score | differential expression score |
9 | DS_score | differential splicing score |
有生物學(xué)重復(fù)
對于生物學(xué)重復(fù)盾舌,推薦使用edgeR分析墓臭,使用prep_CIRIquant
去生成circRNA表達(dá)水平/junction比率的矩陣,使用CIRI_DE_replicate
用于DE分析妖谴。
Step1: 準(zhǔn)備CIRIquant 的輸出文件
需要提供一個文本文件窿锉,包含樣本信息和CIRIquant輸出的GTF文件的路徑信息
CONTROL1 ./c1/c1.gtf C 1
CONTROL2 ./c2/c2.gtf C 2
CONTROL3 ./c3/c3.gtf C 3
CASE1 ./t1/t1.gtf T 1
CASE2 ./t2/t2.gtf T 2
CASE3 ./t3/t3.gtf T 3
默認(rèn)情況下,前三列是必需的膝舅。對于paired 樣本嗡载,還可以添加一個主題名稱列。
column | description |
---|---|
1 | sample name |
2 | path to CIRIquant output gtf |
3 | group ("C" for control, "T" for treatment) |
4 | subject (optional, only for paired samples) |
注意:如果使用CIRI_DE
進(jìn)行差異表達(dá)分析仍稀,第3列g(shù)roup列必須是“C”或者“T”
然后洼滚,運(yùn)行prep_CIRIquant
匯總所以樣本的circRNA的表達(dá)量
Usage:
prep_CIRIquant [options]
-i the file of sample list
--lib where to output library information
--circ where to output circRNA annotation information
--bsj where to output the circRNA expression matrix
--ratio where to output the circRNA junction ratio matrix
Example:
prep_CIRIquant -i sample.lst \
--lib library_info.csv \
--circ circRNA_info.csv \
--bsj circRNA_bsj.csv \
--ratio circRNA_ratio.csv
輸出的結(jié)果可以使用DEseq2與edgeR進(jìn)行分析。
Step2: 準(zhǔn)備 StringTie 的輸出
StringTie的輸出應(yīng)該位于output_dir/gene/prefix_out.gtf
下琳轿。 需要使用stringTie中的prepDE.py來生成用于標(biāo)準(zhǔn)化的基因計(jì)數(shù)矩陣判沟。
例如耿芹,可以提供一個sample_gene.lst
包含樣本 ID 和 StringTie
輸出路徑的文本文件:
CONTROL1 ./c1/gene/c1_out.gtf
CONTROL2 ./c2/gene/c2_out.gtf
CONTROL3 ./c3/gene/c3_out.gtf
CASE1 ./t1/gene/t1_out.gtf
CASE2 ./t2/gene/t2_out.gtf
CASE3 ./t3/gene/t3_out.gtf
然后使用生成的gene_count_matrix.csv
去運(yùn)行prepDE.py -i sample_gene.lst
。
Step3: 差異表達(dá)分析
使用CIRI_DE_replicate
進(jìn)行差異表達(dá)分析挪哄,需要安裝edgeR包
Usage:
CIRI_DE_replicate [options]
--lib library information by CIRIquant
--bsj circRNA expression matrix
--gene gene expression matrix
--out output differential expression result
Example:
CIRI_DE_replicate --lib library_info.csv \
--bsj circRNA_bsj.csv \
--gene gene_count_matrix.csv \
--out circRNA_de.tsv
輸出結(jié)果是未過濾的吧秕,可以設(shè)定閾值對結(jié)果進(jìn)行過濾。
參考資料: