LTR_FINDER | prediction of full-length LTR retrotransposons

一、簡(jiǎn)介

長(zhǎng)末端重復(fù)序列(long terminal repeated,LTR):反轉(zhuǎn)錄病毒的基因組的兩端各有一個(gè)長(zhǎng)末端重復(fù)序列(5'—LTR和3'—LTR)挚瘟,不編碼蛋白質(zhì)蝇率,但含有啟動(dòng)子,增強(qiáng)子等調(diào)控元件刽沾,病毒基因組內(nèi)的LTR可轉(zhuǎn)移到細(xì)胞原癌基因鄰近處本慕,使這些原癌基因在LTR強(qiáng)啟動(dòng)子和增強(qiáng)子的作用下被激活,將正常細(xì)胞轉(zhuǎn)化為癌細(xì)胞侧漓。

結(jié)構(gòu)見(jiàn)下圖

image.png

圖中TSD表示target site duplications锅尘,紅色三角表示LTR motif。A圖是一個(gè)完整的LTR結(jié)構(gòu)布蔗,其中a,b,c是LTR_retriever的分析目標(biāo)藤违。

Annotation of LTR retrotransposons relies primarily on de novo approaches due to their highly diverse terminal repeats.

二、軟件使用

  • Given DNA sequences, it predicts locations and structure of full-length LTR retrotransposons accurately by considering common structural features.

  • ab initio LTR retrotransposon finding.

analysis of many sequences of LTR elements in nearly 20 years revealed some structural features (signals) common in these elements, including Long Terminal Repeats (LTRs), Target Site Repeats (TSRs), Primer Binding Sites (PBSs), Polypurine Tract (PPT) and TG ... CA box, as well as sites of Reverse Transcriptase (RT), Integrase (IN) and RNaseH (RH). These results have made ab initio computer discovery of LTR elements possible.

第一步纵揍,用LTR_FINDER找到基因組的LTR序列

~/opt/biosoft/LTR_Finder/source/ltr_finder  \
  -D 20000 -d 1000 \
  -L 700 -l 100 \
  -p 20 -C -M 0.9 Athaliana.fa >Athaliana.finder.scn
  • -D表示5'和3'LTR之間的最大距離; -d表示5'和3'LTR之間的最小距離;
  • -L表示5'和3'LTR序列的最大長(zhǎng)度; -l表示5'和3'LTR序列的最小長(zhǎng)度;
  • -p表示完全匹配配對(duì)的最小長(zhǎng)度;
  • -C表示檢測(cè)中心粒(centriole)刪除高度重復(fù)區(qū)域;
  • -M表示最小的LTR相似度顿乒。

第二步運(yùn)行LTR_retriever根據(jù)LTR_FINDER的輸出識(shí)別LTR-RT,生成非冗余LTR-RT文庫(kù)泽谨,可用于基因組注釋

>~/opt/biosoft/LTR_retriever/LTR_retriever -threads 4 -genome Athaliana.fa -infinder Athaliana.finder.scn

這里的-infinder表示輸入來(lái)自于LTR_FINDER璧榄,這一步會(huì)調(diào)用RepeatMasker,而RepeatMasker要求序列ID長(zhǎng)度不大于50個(gè)字符

三吧雹、LTR_FINDER_parallel

LTR_FINDER的并行化能夠快速識(shí)別長(zhǎng)末端重復(fù)逆轉(zhuǎn)錄轉(zhuǎn)座子

我們假設(shè)高度復(fù)雜基因組的完整序列可能包含大量復(fù)雜的嵌套(nested)結(jié)構(gòu)骨杂,以指數(shù)級(jí)增加搜索空間。為了分解這些復(fù)雜的序列結(jié)構(gòu)雄卷,我們將染色體序列分成相對(duì)較短的片段(1Mb)搓蚪,并且并行地執(zhí)行LTR_FINDER。我們期望LTR_FINDER_parallel的時(shí)間復(fù)雜度為O(n)丁鹉。對(duì)于高度復(fù)雜的區(qū)域(即著絲粒)妒潭,其中一段可能需要相當(dāng)長(zhǎng)的時(shí)間(即數(shù)小時(shí))。為了避免在這些區(qū)域中延長(zhǎng)的操作時(shí)間揣钦,我們使用了一個(gè)超時(shí)方案(300秒)來(lái)控制子進(jìn)程可以運(yùn)行的最長(zhǎng)時(shí)間雳灾。如果超時(shí),則將1Mb片段進(jìn)一步分割為50kb片段拂盯,以挽救LTR候選片段佑女。在處理所有片段后,將LTR候選基因的區(qū)域坐標(biāo)轉(zhuǎn)換回基因組水平坐標(biāo)谈竿,便于下游分析团驱。

We hypothesized that complete sequences of highly complex genomes may contain a large number of com?plicated nested structures that exponentially increase the search space. To break down these complicated sequence structures, we **split chromosomal sequences into relatively short segments (1 Mb) **and executes LTR_FINDER in parallel. We expect the time complexity of LTR_FINDER_parallel is O(n). For highly complicated regions (i.e., centromeres), one segment could take a rather long time (i.e., hours). To avoid extended operation time in such regions, we used a timeout scheme (300 s) to control for the longest time a child process can run. If timeout, the 1 Mb segment is further split into 50 Kb segments to salvage LTR candidates. After processing all segments, the regional coordinates of LTR candidates are converted back to the genome-level coordinates for the convenience of downstream analyses.

Usage: perl LTR_FINDER_parallel -seq [file] -size [int] -threads [int]  
Options:
    -seq    [file]  Specify the sequence file.
    -size   [int]   Specify the size you want to split the genome sequence.
            Please make it large enough to avoid spliting too many LTR elements. Default 5000000 (bp).               
    -time   [int]   Specify the maximum time to run a subregion (a thread).
            This helps to skip simple repeat regions that take a substantial of time to run. Default: 1500 (seconds).
            Suggestion: 300 for -size 1000000. Increase -time when -size increased.  
    -try1   [0|1]   If a region requires more time than the specified -time (timeout), decide:  
                0, discard the entire region.
                1, further split to 50 Kb regions to salvage LTR candidates (default);
    -harvest_out    Output LTRharvest format if specified. Default: output LTR_FINDER table format.
    -next           Only summarize the results for previous jobs without rerunning LTR_FINDER (for -v).
    -verbose|-v     Retain LTR_FINDER outputs for each sequence piece.
    -finder [file]  The path to the program LTR_FINDER (default v1.0.7, included in this package).
    -threads|-t     [int]   Indicate how many CPU/threads you want to run LTR_FINDER.
    -check_dependencies Check if dependencies are fullfiled and quit
    -help|-h        Display this help information.

1. Input

Genome file in multi-FASTA format.

2. Output

GFF3, LTRharvest (STDOUT)or LTR_FINDER (-w 2) formats of predicted LTR candidates.

3. Parameter setting for LTR_FINDER

Currently there is no parameter settings for LTR_FINDER in this parallel version. I have chose the "best" parameters for you,Please refer to LTR_FINEDR for details of these parameters.

-w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85

If you want to use other parameters in LTR_FINDER_parallel, please edit the file LTR_FINDER_parallel line 9 to change the preset parameters.

Based on our previous study [1], we applied the optimized parameter for LTR_FINDER (?w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85), which identifies long terminal repeats ranging from 100 to 7000 bp with identity ≥85% and interval regions from 1 to 15 Kb. The output of LTR_FINDER_parallel is convertible to the popular LTRharvestformat, which is compatible to the high-accuracy post-processing filter LTR_retriever.

4. Performance benchmark

Genome Arabidopsis Rice Maize Wheat
Version TAIR10 MSU7 AGPv4 CS1.0
Size 119.7 Mb 374.5 Mb 2134.4 Mb 14547.3 Mb
Original memory (1 CPU*) 0.37 Gbyte 0.55 Gbyte 5.00 Gbyte 11.88 Gbyte
Parallel memory (36 CPUs*) 0.10 Gbyte 0.12 Gbyte 0.82 Gbyte 17.67 Gbyte
Original time (1 CPU) 0.58 h 2.1 h 448.5 h 10169.3 h
Parallel time (36 CPUs) 6.4 min 2.6 min 10.3 min 71.8 min
Speed up 5.4 x 48.5 x 2,613 x 8,498 x
Number of LTR candidates (1 CPU) 226 2,851 60,165 231,043
Number of LTR candidates (36 CPUs) 226 2,834 59,658 237,352
% difference of candidate # 0.00% 0.60% 0.84% -2.73%

*Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz

5. FAQs and best practices

(1)How to generate output files for
A: You can use the -harvest_out parameter to generate LTRharvest-format output, then feed to LTR_retriever using -inharvest. If you have more than one LTRharvest output, simply cat them together.

(2)How to prepare the genome file?
A: It's highly recommended to use short and simple sequence names. For example, use letters, numbers, and _ to generate unique names shorter than 15 bits. This will make your downstream analyses much more easier. If you have delicate sequence names and encounter errors, you may want to simplify them and try again.

(3)Do I really need to modify the -size, -time, and -try1 parameters?
A: Not really. Except when you are 100% sure what you are doing, these parameters are optimized for the best performance in general.

6. Issues

Currently I am using a non-overlapping way to cut the original sequence. Some LTR elements could be broken due to this. So far the side-effect is minimal (< 1% loss) comparing to the performance boost (up to 8,500X faster). I don't have a plan to update it to a sliding window scheme. Welcome to improve it and request for merge.

參考來(lái)源:
https://github.com/xzhub/ltr_finder
https://gitee.com/xdkong/LTR_FINDER_parallel
https://www.cnblogs.com/bio-mary/p/12187157.html

Ou S, Jiang N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mob DNA 2019;10(1):48.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末空凸,一起剝皮案震驚了整個(gè)濱河市嚎花,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌呀洲,老刑警劉巖紊选,帶你破解...
    沈念sama閱讀 218,682評(píng)論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件啼止,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡兵罢,警方通過(guò)查閱死者的電腦和手機(jī)献烦,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,277評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)卖词,“玉大人巩那,你說(shuō)我怎么就攤上這事〈蓑冢” “怎么了即横?”我有些...
    開(kāi)封第一講書(shū)人閱讀 165,083評(píng)論 0 355
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)裆赵。 經(jīng)常有香客問(wèn)我东囚,道長(zhǎng),這世上最難降的妖魔是什么战授? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,763評(píng)論 1 295
  • 正文 為了忘掉前任页藻,我火速辦了婚禮,結(jié)果婚禮上陈醒,老公的妹妹穿的比我還像新娘惕橙。我一直安慰自己,他們只是感情好钉跷,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,785評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著肚逸,像睡著了一般爷辙。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上朦促,一...
    開(kāi)封第一講書(shū)人閱讀 51,624評(píng)論 1 305
  • 那天膝晾,我揣著相機(jī)與錄音,去河邊找鬼务冕。 笑死血当,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的禀忆。 我是一名探鬼主播臊旭,決...
    沈念sama閱讀 40,358評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼箩退!你這毒婦竟也來(lái)了离熏?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 39,261評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤戴涝,失蹤者是張志新(化名)和其女友劉穎滋戳,沒(méi)想到半個(gè)月后钻蔑,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,722評(píng)論 1 315
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡奸鸯,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,900評(píng)論 3 336
  • 正文 我和宋清朗相戀三年咪笑,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片娄涩。...
    茶點(diǎn)故事閱讀 40,030評(píng)論 1 350
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡窗怒,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出钝满,到底是詐尸還是另有隱情兜粘,我是刑警寧澤,帶...
    沈念sama閱讀 35,737評(píng)論 5 346
  • 正文 年R本政府宣布弯蚜,位于F島的核電站孔轴,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏碎捺。R本人自食惡果不足惜路鹰,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,360評(píng)論 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望收厨。 院中可真熱鬧晋柱,春花似錦、人聲如沸诵叁。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,941評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)拧额。三九已至碑诉,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間侥锦,已是汗流浹背进栽。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,057評(píng)論 1 270
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留恭垦,地道東北人快毛。 一個(gè)月前我還...
    沈念sama閱讀 48,237評(píng)論 3 371
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像番挺,于是被迫代替她去往敵國(guó)和親唠帝。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,976評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容