Defining unique sequence abundances 信息搬運(yùn)

信息來(lái)源>https://www.drive5.com/usearch/manual/global_trimming_and_abundance.html


Defining abundance when sequence length varies

Calculating unique sequence abundance is problematic when reads of the same template sequence vary in length, e.g. because reads are truncated when the quality score drops below a threshold.

Consider two reads A and B where B is shorter but otherwise identical to A. Here, abundance could be defined in three different ways.

(1) There are two unique sequences A and B, each with abundance one.

(2) There is one unique sequence A with abundance two.

(3) There is one unique sequence B with abundance two.

All of these definitions have problems.

With (1), a given template sequence with high abundance in the amplicons will typically have many different unique sequences with low abundances because its reads are truncated to many different lengths.

With (2) the unmatched tail of A is considered to have the same abundance as the prefix of A that is identical to B. The tail has no support from other reads (it is effectively a singleton), but that information is lost and in practice long reads with noisy tails are assigned high abundances.

With (3), the shortest sequence in a set is supported by longer sequences. This is the least bad definition: if the abundance is high, the sequence is likely to be correct. However, phylogenetically and phenotypically informative bases may be lost, and the ambiguities inherent in comparing sequences of different length must now be addressed by downstream algorithms (e.g., denoising or OTU clustering). For example, if two unique sequences differ in length by one base and have one substitution, should this count as d=1 (just the substitution) or d=2 (substitution plus terminal gap)? If large variations in length are allowed, then the phylogenetic and phenotypic resolution of the sequences may vary substantially, degrading the comparability of ZOTUs or OTUs to each other for calculating diversity, predicting taxonomy and so on. These problems are avoided by ensuring that reads of the same template sequence have the same length (global trimming, implying that reads of the same template should be globally alignable, though more distantly related sequences need not be).

Methods for global trimming

The simplest method for?global trimming?is to truncate all reads to the same length. This is not usually necessary with overlapping Illumina paired-end reads that have been merged by a paired-read assembler. In this case, the merged sequence always terminates at the reverse primer which guarantees that reads of the same template will have the same length regardless of variations in amplicon length between different species. If multiple primers were used which do not bind to the same locus, then trimming is required to ensure that reads of the same template amplified by different primers start and end at the same position in the biological sequence. Primer-binding bases should be discarded from the reads because PCR tends to induce substitutions at mismatched positions; in most cases this is easily accomplished by discarding a fixed number of bases (the primer lengths) from each end of the sequence. There is no need to explicitly match the primer sequence in order to trim it unless there are multiple primers binding to different loci.


Global trimming

The goal of global trimming is to ensure that reads from the same template?have the same length. More accurately, it should ensure that sequences start and end at exactly the same position. If you don't do this, then two reads of the same biological sequence may have different lengths, and this causes problems in calculating the abundances of unique sequences.

Other way to state the goal of global trimming is that there should be no?terminal gaps?in an alignment of reads of the same template.

See?defining unique sequence abundance?for a technical discussion explaining why this step is essential.

The appropriate strategy for global trimming depends on your reads. See also?global trimming for fungal ITS reads.

Paired reads which always overlap

If the read length is long enough that the longest amplicon will given an overlap of at least, say, 32 bases, then you don't need any additional trimming:? fastq_mergepairs does everything you need. Short amplicons will create "staggered" pairs which are correctly truncated during the merging.

Paired reads which sometimes or never overlap

If the read length is not long enough to get overlaps on longer amplicons, then you can't use the reverse reads. The best strategy is simply to discard the reverse reads (R2s) and make OTUs from the forward (R1) reads alone. See below under "Unpaired reads" for the appropriate trimming strategy.

Unpaired reads which never reach the reverse primer

If you have unpaired reads which never reach the reverse primer then they should be trimmed to a fixed length. If the reads are already fixed length (e.g. forward Illumina reads), then no trimming is necessary. You might choose to trim to a shorter length if the read quality is poor towards the end of the read (see?fastq_eestats2?and?fastx_truncate).

Unpaired reads which sometimes or always reach the reverse primer

If a read continues past the reverse primer, then it will include adapter sequence and then random junk. The adapter and junk must be discarded. It is probably also a good idea to delete the primer sequence since PCR tends to force the primer-binding locus to match the primer. Unfortunately, there is currently no easy way to do this in USEARCH. You can use?search_oligodb?to find the reverse primer, but you will need to write your own script to truncate the reads. If this is a real problem for you,?let me know?and I'll look into making a new command for you.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末权她,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子药有,更是在濱河造成了極大的恐慌攘已,老刑警劉巖避消,帶你破解...
    沈念sama閱讀 219,427評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件命雀,死亡現(xiàn)場(chǎng)離奇詭異糯钙,居然都是意外死亡碟婆,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,551評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門进泼,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)蔗衡,“玉大人,你說(shuō)我怎么就攤上這事乳绕≌扯迹” “怎么了?”我有些...
    開封第一講書人閱讀 165,747評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵刷袍,是天一觀的道長(zhǎng)翩隧。 經(jīng)常有香客問(wèn)我,道長(zhǎng)呻纹,這世上最難降的妖魔是什么堆生? 我笑而不...
    開封第一講書人閱讀 58,939評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮雷酪,結(jié)果婚禮上淑仆,老公的妹妹穿的比我還像新娘。我一直安慰自己哥力,他們只是感情好蔗怠,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,955評(píng)論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著吩跋,像睡著了一般寞射。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上锌钮,一...
    開封第一講書人閱讀 51,737評(píng)論 1 305
  • 那天桥温,我揣著相機(jī)與錄音,去河邊找鬼梁丘。 笑死侵浸,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的氛谜。 我是一名探鬼主播掏觉,決...
    沈念sama閱讀 40,448評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼值漫!你這毒婦竟也來(lái)了澳腹?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,352評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎遵湖,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體晚吞,經(jīng)...
    沈念sama閱讀 45,834評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡延旧,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,992評(píng)論 3 338
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了槽地。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片迁沫。...
    茶點(diǎn)故事閱讀 40,133評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖捌蚊,靈堂內(nèi)的尸體忽然破棺而出集畅,到底是詐尸還是另有隱情,我是刑警寧澤缅糟,帶...
    沈念sama閱讀 35,815評(píng)論 5 346
  • 正文 年R本政府宣布挺智,位于F島的核電站,受9級(jí)特大地震影響窗宦,放射性物質(zhì)發(fā)生泄漏赦颇。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,477評(píng)論 3 331
  • 文/蒙蒙 一赴涵、第九天 我趴在偏房一處隱蔽的房頂上張望媒怯。 院中可真熱鬧,春花似錦髓窜、人聲如沸扇苞。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,022評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)鳖敷。三九已至,卻和暖如春程拭,著一層夾襖步出監(jiān)牢的瞬間哄陶,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 33,147評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工哺壶, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留屋吨,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 48,398評(píng)論 3 373
  • 正文 我出身青樓山宾,卻偏偏與公主長(zhǎng)得像至扰,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子资锰,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,077評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,336評(píng)論 0 10
  • Introduction What is Bowtie 2? Bowtie 2 is an ultrafast a...
    wzz閱讀 5,673評(píng)論 0 5
  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,511評(píng)論 0 23
  • 一敢课、1、不丟失,不泄露直秆;2濒募、7x24小時(shí)運(yùn)行不宕機(jī);3圾结、讓用戶有更好的體驗(yàn)瑰剃; 二、從根開始的路徑就是絕對(duì)路徑筝野;不從...
    DenyCwen閱讀 183評(píng)論 0 0
  • 2017.11.3 1.5 線索的魅力 線索就好比是一位主持人晌姚,串起整臺(tái)節(jié)目。好的線索也可以讓PPT的結(jié)構(gòu)...
    amylismile閱讀 178評(píng)論 0 0