EVM 對預(yù)測結(jié)果進(jìn)行整合

從頭預(yù)測，同源注釋和轉(zhuǎn)錄組整合都會得到一個預(yù)測結(jié)果娃胆，EVidenceModeler(EVM) 可以對上述結(jié)果進(jìn)整合

軟件安裝

wget -4 https://github.com/EVidenceModeler/EVidenceModeler/archive/v1.1.1.tar.gz
tar xf v1.1.1.tar.gz
# 添加環(huán)境變量

使用流程

所需數(shù)據(jù)

gene_prediction.gff3
標(biāo)準(zhǔn)的gff3格式冈钦，必須要有g(shù)ene, mRNA, exon, CDS這些特征
protein_alignments.gff3: 標(biāo)準(zhǔn)的GFF3格式篡悟，第9列要有ID信和和target信息, 標(biāo)明是比對結(jié)果
transcript_alignments.gff3:標(biāo)準(zhǔn)的GFF3格式致板，第9列要有ID信和和target信息，標(biāo)明是比對結(jié)果

$\color{red}{若有多個同類型注釋結(jié)果坊夫，可以cat在一起}$

運行EVM

創(chuàng)建權(quán)重文件

# copy /EVidenceModeler-1.1.1/simple_example 下的weights.txt進(jìn)行修改
cp ~ /EVidenceModeler-1.1.1/simple_example/weights.txt ./
vi weights.txt
ABINITIO_PREDICTION      augustus       4
TRANSCRIPT      assembler-database.sqlite      7
OTHER_PREDICTION  transdecoder  8
## 第一列為來源類型砖第；分為：ABINITIO_PREDICTION, PROTEIN, TRANSCRIPT
## 第二列對應(yīng)著gff3文件第二列
## 第三列為權(quán)重

分割原始數(shù)據(jù), 用于后續(xù)并行

/EVidenceModeler-1.1.1/EvmUtils/partition_EVM_inputs.pl \
           --genome ref.fa \
           --gene_predictions gene_predictions.gff3 \
           --transcript_alignments transcript_alignments.gff3 \
           --segmentSize 100000 --overlapSize 10000 \
           --partition_listing partitions_list.out
# 參數(shù)
--genome： fasta file containing all genome sequences
--gene_predictions：* file containing gene predictions
--protein_alignments: file containing protein alignments
--transcript_alignments：file containing transcript alignments
--segmentSize：* :length of a single sequence for running EVM
--overlapSize： * :length of sequence overlap between segmented sequences
--partition_listing     * :name of output file to be created that contains the list of partitions

--segmentsSize設(shè)置的大小需要少于1Mb(這里是100k)岸售， --overlapSize的不能太小，如果數(shù)學(xué)好厂画，可用設(shè)置成基因平均長度加上2個標(biāo)準(zhǔn)差，數(shù)學(xué)不好拷邢，就設(shè)置成10K吧

創(chuàng)建并行運算命令并且執(zhí)行

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/write_EVM_commands.pl --genome ref.fa --weights `pwd`/weights.txt \
      --gene_predictions gene_predictions.gff3 \
      --transcript_alignments transcript_alignments.gff3 \
      --output_file_name evm.out  --partitions partitions_list.out >  commands.list
~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/execute_EVM_commands.pl commands.list


#參數(shù)
 --weights          | -w    weights for evidence types file

合并運行結(jié)果

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out

結(jié)果轉(zhuǎn)換成GFF3

~/opt/biosoft/EVidenceModeler-1.1.1/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out  --genome ref.fa
find . -regex ".*evm.out.gff3" -exec cat {} \; | bedtools sort -i - > EVM.all.gff

過濾gff文件

注釋過濾：對于初步預(yù)測得到的基因袱院，還可以稍微優(yōu)化一下，例如剔除編碼少于50個AA的預(yù)測結(jié)果瞭稼，將轉(zhuǎn)座子單獨放到一個文件中(軟件有TransposonPSI)忽洛。

這里基于gffread先根據(jù)注釋信息提取所有的CDS序列，過濾出長度不足50AA的序列环肘，基于這些序列過濾原來的的注釋

gffread EVM.all.gff -g input/genome.fa -y tr_cds.fa
bioawk -c fastx '$seq < 50 {print $comment}' tr_cds.fa | cut -d '=' -f 2 > short_aa_gene_list.txt
grep -v -w -f short_aa_gene_list.txt EvM.all.gff > filter.gff

參考

如何對基因組序列進(jìn)行注釋

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

EVM 對預(yù)測結(jié)果進(jìn)行整合

軟件安裝

使用流程

所需數(shù)據(jù)

運行EVM

創(chuàng)建權(quán)重文件

分割原始數(shù)據(jù), 用于后續(xù)并行

創(chuàng)建并行運算命令并且執(zhí)行

合并運行結(jié)果

結(jié)果轉(zhuǎn)換成GFF3

過濾gff文件

參考