本文方法是根據(jù)"Ageing LTR insertions"(https://github.com/SIWLab/Lab_Info/wiki/Ageing-LTR-insertions)這篇文章提及的方法進行的. 但該文很多部分都是簡略介紹,提供的script也已經(jīng)失效,因而只能自己重新寫腳本.指導思想看該文,本文只補充實操部分.
用LTRharvest尋找LTR后,獲得注釋文件EGL.gff
其中有l(wèi)ong_terminal_repeat的行為LTR所在注釋行, 用此行信息提取出LTR對,每對生成一個文件
LTRharvest的結(jié)果又一個坑爹的地方,它會不管原來的序列名稱是什么,統(tǒng)一替換為seq0, seq1, seq2......等
提取前,首先要將gff的seq還原為原來的染色體名字,將前面有#的注釋行刪除. 在第一行加上"###"號,以供awk識別.
mkdir ltr
cd ltr
cp EGL.gff.rename
ulimit -n 100000 #解除文件數(shù)限制,否則報錯
awk '/###/{filename=NR".txt"}; /long_terminal_repeat/{gsub(/Parent=/,"");print $1,$4,$5,$9 >filename}' OFS="\t" EGL.gff.rename #生成每對ltr的bed
ls | grep txt | sort -n > file.list #生成bed的列表
cp file.list ..
cd ..
目標是用bedtools提取每對LTR,再用muscle比對,這里寫python生成bash:
filelistname = 'F:/sequence/EGL_hub/file.list'
fileresult = filelistname+".sh"
f = open(fileresult , "w")
with open(filelistname,'r') as r:
lines=r.read().splitlines()
i = 0
for l in lines:
i= i+1
f.write("bedtools getfasta -fi ../ensete_glaucum.assembly.fna -bed " + l + " -fo retroltr" + str(i) +".fa -name"+ '\n')
f.write("muscle -in retroltr" + str(i) +".fa -out retroltr" + str(i) +".fa.afa" + '\n')
將所有比對結(jié)果(.afa結(jié)尾)放到一個新的文件夾.這里使用R包ape進行計算:
mkdir ../EGL.ltr
cp *afa ../EGL.ltr
R包代碼:
library(ape)
library(xlsx)
setwd('F:/sequence/ltr_time/ltr.alignment') ###The path should be changed
list <- list.files()
fas.F1 = read.FASTA(list[1])
mat1 = dist.dna(fas.F1,as.matrix = T, model = "K80")
merge.data = mat1[1,2]
list <- list.files()
mutate_rate <- 1.3e-8 #根據(jù)水稻的突變率,(Ma, 2004),可以根據(jù)物種改變
time1 = merge.data/(2*mutate_rate)
v1.merge = c(list[1],merge.data, time)
dir = paste("./",list,sep="")
n = length(dir)
for (i in 2:n){
fas.F.new = read.FASTA(list[i])
mat.new = dist.dna(fas.F.new, as.matrix = T, model = "K80")
time = mat.new[1,2]/(2*mutate_rate)
v.new = c(list[i], mat.new[1,2], time)
v1.merge = rbind(v1.merge, v.new)
}
write.xlsx(v1.merge,file = 'F:/sequence/ltr_time/EGL.xls')
更詳細的描述可見:https://github.com/wangziwei08/LTR-insertion-time-estimation
如果使用了這個插入時間估算方法,希望能引用以下文章:
A chromosome-level reference genome of Ensete glaucum gives insight into diversity and chromosomal and repetitive sequence evolution in the Musaceae