從頭預(yù)測,同源注釋和轉(zhuǎn)錄組整合都會得到一個預(yù)測結(jié)果娃胆,EVidenceModeler(EVM) 可以對上述結(jié)果進(jìn)整合
軟件安裝
wget -4 https://github.com/EVidenceModeler/EVidenceModeler/archive/v1.1.1.tar.gz
tar xf v1.1.1.tar.gz
# 添加環(huán)境變量
使用流程
-
所需數(shù)據(jù)
- gene_prediction.gff3
標(biāo)準(zhǔn)的gff3格式冈钦,必須要有g(shù)ene, mRNA, exon, CDS這些特征 - protein_alignments.gff3: 標(biāo)準(zhǔn)的GFF3格式篡悟,第9列要有ID信和和target信息, 標(biāo)明是比對結(jié)果
- transcript_alignments.gff3:標(biāo)準(zhǔn)的GFF3格式致板,第9列要有ID信和和target信息,標(biāo)明是比對結(jié)果
-
運行EVM
-
創(chuàng)建權(quán)重文件
# copy /EVidenceModeler-1.1.1/simple_example 下的weights.txt進(jìn)行修改
cp ~ /EVidenceModeler-1.1.1/simple_example/weights.txt ./
vi weights.txt
ABINITIO_PREDICTION augustus 4
TRANSCRIPT assembler-database.sqlite 7
OTHER_PREDICTION transdecoder 8
## 第一列為來源類型砖第;分為:ABINITIO_PREDICTION, PROTEIN, TRANSCRIPT
## 第二列對應(yīng)著gff3文件第二列
## 第三列為權(quán)重
-
分割原始數(shù)據(jù), 用于后續(xù)并行
/EVidenceModeler-1.1.1/EvmUtils/partition_EVM_inputs.pl \
--genome ref.fa \
--gene_predictions gene_predictions.gff3 \
--transcript_alignments transcript_alignments.gff3 \
--segmentSize 100000 --overlapSize 10000 \
--partition_listing partitions_list.out
# 參數(shù)
--genome: fasta file containing all genome sequences
--gene_predictions:* file containing gene predictions
--protein_alignments: file containing protein alignments
--transcript_alignments:file containing transcript alignments
--segmentSize:* :length of a single sequence for running EVM
--overlapSize: * :length of sequence overlap between segmented sequences
--partition_listing * :name of output file to be created that contains the list of partitions
--segmentsSize設(shè)置的大小需要少于1Mb(這里是100k)岸售, --overlapSize的不能太小,如果數(shù)學(xué)好厂画,可用設(shè)置成基因平均長度加上2個標(biāo)準(zhǔn)差,數(shù)學(xué)不好拷邢,就設(shè)置成10K吧
-
創(chuàng)建并行運算命令并且執(zhí)行
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/write_EVM_commands.pl --genome ref.fa --weights `pwd`/weights.txt \
--gene_predictions gene_predictions.gff3 \
--transcript_alignments transcript_alignments.gff3 \
--output_file_name evm.out --partitions partitions_list.out > commands.list
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/execute_EVM_commands.pl commands.list
#參數(shù)
--weights | -w weights for evidence types file
-
合并運行結(jié)果
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
-
結(jié)果轉(zhuǎn)換成GFF3
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output evm.out --genome ref.fa
find . -regex ".*evm.out.gff3" -exec cat {} \; | bedtools sort -i - > EVM.all.gff
過濾gff文件
注釋過濾:對于初步預(yù)測得到的基因袱院,還可以稍微優(yōu)化一下,例如剔除編碼少于50個AA的預(yù)測結(jié)果瞭稼,將轉(zhuǎn)座子單獨放到一個文件中(軟件有TransposonPSI)忽洛。
這里基于gffread先根據(jù)注釋信息提取所有的CDS序列,過濾出長度不足50AA的序列环肘,基于這些序列過濾原來的的注釋
gffread EVM.all.gff -g input/genome.fa -y tr_cds.fa
bioawk -c fastx '$seq < 50 {print $comment}' tr_cds.fa | cut -d '=' -f 2 > short_aa_gene_list.txt
grep -v -w -f short_aa_gene_list.txt EvM.all.gff > filter.gff