解決 Stringtie 基因重復(fù)定量的意外收獲

說明：因為平臺限制和平臺廣告等原因，今后的文章將不在簡書更新，請移步并訂閱個人博客

說明：因為平臺限制和平臺廣告等原因乖寒，今后的文章將不在簡書更新，請移步并訂閱個人博客

說明：因為平臺限制和平臺廣告等原因院溺，今后的文章將不在簡書更新楣嘁，請移步并訂閱個人博客

Bioinformatics was like a box of chocolates. You never konw what you're gonna get ≈ 咱也不知道，咱也不敢問珍逸。

鋪墊

由于自己之前一直不喜歡用 cufflinks逐虚，所以后面的 stringtie 也是能不用就不用，偶爾用下也都是淺嘗輒止谆膳。因為 stringtie 可以直接拿到基因的 TPM 值痊班，比 RSEM 需要單獨構(gòu)建一次索引的操作省力些，所以最近自己注釋了些基因就用它對十幾個樣本跑了一波基因定量的常規(guī)操作摹量。心想做個表達(dá)矩陣進(jìn)行下游分析，結(jié)果偶買噶馒胆，每個定量結(jié)果的行數(shù)（基因數(shù)）竟然都不一樣缨称。同一個注釋文件定量出了不同的結(jié)果，檢查一下原始基因數(shù)祝迂，發(fā)現(xiàn)有的定量結(jié)果行數(shù)要比實際基因數(shù)多7個睦尽，有的要多5個，還不盡相同型雳。

嘗試

遇事不怕事当凡，先看看多出來了什么基因，我發(fā)現(xiàn)有八九個基因竟然會重復(fù)出現(xiàn)在定量結(jié)果中纠俭⊙亓浚回到注釋文件里看這些基因為什么作妖，隨便看了3個發(fā)現(xiàn)一個規(guī)律冤荆。這些基因?qū)?yīng)的轉(zhuǎn)錄本 ID 并不是連續(xù)的（因為上游分析時我過濾掉了一些不符合自己要求的轉(zhuǎn)錄本）朴则，如下截圖所示，gff 刪掉了2和3钓简，結(jié)果在定量時出現(xiàn)了兩個相同的ID 乌妒。

如此這般汹想，難道是 stringtie 不能識別不連續(xù) ID 轉(zhuǎn)錄本的基因？暗自感覺這個bug無語的同時動手把那幾個命名有 gap 的轉(zhuǎn)錄本 ID重命名為連續(xù)撤蚊，又測試了一次還是重復(fù)出現(xiàn)了這個問題古掏。看來并不是 ID 不連續(xù)造成的侦啸，肯定還有一些背后的原因槽唾。

再一次仔細(xì)觀察這些不連續(xù)的轉(zhuǎn)錄本信息，CS.104750 在最后的表達(dá)定量中出現(xiàn)了兩次匹中，且一個基因給出的位置坐標(biāo)是在轉(zhuǎn)錄本1和4上（18509027-18513961）夏漱，另一個基因位置是在4-8上（18574381-18579343）。如果是通過 ID 來拆分轉(zhuǎn)錄本應(yīng)該是轉(zhuǎn)錄本1獨立為一個基因顶捷，其它轉(zhuǎn)錄本獨立為一個基因挂绰。這里似乎是按照位置進(jìn)行了區(qū)分，因為轉(zhuǎn)錄本1服赎、4和另外5-8之間沒有位置上的重疊葵蒂，所有被分為了兩個基因。在檢查其它幾個重復(fù)出現(xiàn)的基因重虑，都符合我這個猜想践付。

好了，現(xiàn)在另一個問題是缺厉，如果是沒有重疊區(qū)域的轉(zhuǎn)錄本會被按照獨立的基因單獨統(tǒng)計永高，那為什么不同的樣本重復(fù)的基因數(shù)量并不一樣呢？到底 stringtie 是怎么一個定量邏輯提针，到這里靠猜就解決不了問題了命爬，只能去找原始文獻(xiàn)和源代碼。

驚喜

歷史的經(jīng)驗告訴我自己不太可能是第一個遇到這問題的人辐脖，于是問了一些朋友但也是答非所問饲宛。不如干脆去 GitHub 提個 issue，習(xí)慣性在提之前嘗試搜了一下 stringtie gene abundance duplicate 嗜价。用這幾個關(guān)鍵詞艇抠，Google 第二個結(jié)果就是一個和我完全一樣的困惑，而且也是在 GitHub 上提出的問題久锥。

我點進(jìn)去發(fā)現(xiàn)內(nèi)容還挺長家淤，開發(fā)者和提問人有好幾個回合的問答。讀完不禁長處一口氣瑟由，這個 issue 不光解答了我的困惑媒鼓，還提供了一個如何提問的絕佳范本。非常值得我們?nèi)W(xué)習(xí)和反思。

以下是主要內(nèi)容的摘錄和簡要評論：

I am currently running stringtie on Arabidopsis rna-seq data sets and I am encountering duplication events that vary from sample to sample. The command I am running is:

stringtie -v -p 1 -e -o test.gtf -G TAIR10_Araport11.gtf -A test.ga -l test OutputFromHisat-Samtools_sort-Samtoools_Index.bam

I am using the arabidopsis gtf downloaded from TAIR, and the reads I am running are downloaded from NCBI. My workflow is trimmomatic, hisat2, samtools_sort, samtools_index, stringtie. I am running stringtie v1.3.4d. My workflow was run on a llinux slurm cluster, and I have also verified the same results on my local machine (Ubuntu 18.04).

I typically have 1 or 2 genes per run that have duplication events.

A colleague running the same workflow on bovine data is having the same issue.

提問人在問題的描述中首先給出了自己處理的數(shù)據(jù)是什么物種绿鸣，然后附上了自己處理的完整命令疚沐，另外也說了自己的參考基因組下載自哪里。同時寫明了自己的分析流程和 stringtie 對應(yīng)的軟件版本潮模。作者也強調(diào)自己在服務(wù)器集群和本地都是一個結(jié)果亮蛔，以及自己的朋友也重復(fù)了這個問題——總有一兩個基因在定量的時候會被重復(fù)呈現(xiàn)。

在開發(fā)者的第一次回復(fù)中提到了是不是注釋文件做過某些處理擎厢，本身存在重復(fù)的gene id究流，另外希望能提供一點有問題的bam文件，另外作者還非常貼心的提供了如何提取一些bam結(jié)果的方法动遭。這里就不引用原始內(nèi)容了芬探。以下是提問者的第二次回復(fù)。

Sorry, the name of the gtf was confusing, it makes more sense in the context of our workflow. Yes, the gtf was produced from the file on the TAIR page named

Araport11_GFF3_genes_transposons.201606.gff.gz

using the command from Cufflinks

gffread -E Araport11_GFF3_genes_transposons.201606.gff -T -o- > TAIR10_Araport11.gtf

I cannot attach my file due to size constraints, but by downloading the above file and using that command you should be able to get the same reference file to work with. The file that results from the above command does not appear to have duplicates.

In the output files from stringtie on my arabidopsis data, each file seems to have a random duplication of 1 of three genes. Those genes are:

ATCG09900
AT1G79790
AT1G58520

Here is a subset of my list of my output files (whole list is very long), and below each file name is what gene is duplicated (bash uniq command). The name of the file indicates the srx ID from NCBI. Each of these samples had multiple runs associated with it (sra). Before running through the pipeline these “sra” were combined into one large fastq file corresponding to their respective “srx”. You will notice that each file had different combinations of genes that were duplicated, but that they were always 1 of the 3 genes listed above.

As I was thinking about it, I got paranoid that this combining of sra’s into srx may have caused the issue, so I redid it with a sample without combining (used sample SRR4426362). I got the same results. The attached files are a subset of the resulting *.bam that will cause a duplication event when used with this command and the above mentioned reference file:

stringtie -v -p 1 -e -o test1.gtf -G TAIR10_Araport11.gtf -A test1.ga -l catz bundle.bam

I used this command to see the duplicated gene in the gene abundance file:

awk '{print $1}' test1.ga|sort |uniq -d
# AT1G58520

I have also included the output gene abundance file from this subset bam.

I went through all of this again to verify that the results are still happening. Our workflow is using the latest editions of all of the other software as well.

I had to rename the attached files with a ".txt" extension to be able to upload them to git hub, so I hope that the bam still work when you get it.

這一段回答內(nèi)容比較長厘惦，提問者首先解釋了自己對 gtf 文件進(jìn)行了哪些操作以及具體的命令是什么偷仿，同時他又描述了一些自己的做過的嘗試，并且排除了自己認(rèn)為可能的出現(xiàn)問題的原因宵蕉，最后還上傳了開發(fā)者想要的測試文件酝静。以下是開發(fā)者的第二次回復(fù)，已經(jīng)涉及到了關(guān)于這個問題的一些具體解釋羡玛。

Ah, I thought you somehow suggested that transcripts were duplicated (the fix mentioned in the release notes for v1.3.1 was about that), but now I see that your problem is that genes are duplicated in the "gene abundance" output file.

That's actually not quite true; each line in that output file is rather about estimating abundance for "gene" regions (loci), formed by overlapping transcripts, and in some cases such gene regions are not uniquely identified by their ID/name, the coordinates (genomic location) of each such "gene region" (locus) sometimes make a very important distinction.

Look at the transcripts for that gene (AT1G58520) in the annotation file: there are 7 transcripts there, but one of them (AT1G58520.3) is actually not overlapping any of the others, so StringTie treats it as a separate locus ("gene"), and thus it creates a separate entry in the gene abundance file for that particular locus (AT1G58520|Chr1(+)21729913-21731344) which is different from the other, "main" locus which has all the other overlapping transcripts: AT1G58520|Chr1(+)21732566-21738808

StringTie cannot "trust" the reference annotation, as sometimes the gene ID can be just a duplicated string providing no true locus identity.. So you can think of it as StringTie splitting that "gene" into two non-overlapping gene regions and assessing the expression for each gene region independently. Again, the identity of a "gene" (actually "locus") in that file is given by gene ID and the genomic location of the underlying cluster of transcripts which define that locus.

As you can see, that single transcript gene region (locus Chr1(+):21729913-21731344) has zero coverage so I don't know if that gene region is even real (or perhaps just an annotation artifact).

If you really do not care about these situations (where gene definitions are not quite consistent with the transcripts so StringTie separates them like this), you could just add up the coverage values for all these lines with the same gene ID in the gene abundance file and call that the "total" abundance of that "gene" -- but this could be really misleading if the gene ID is not uniquely identifying a locus, i.e. if it's duplicated in other places on the genome (perhaps even on another chromosome).

這一段的作者的回答信息量比較大别智，和我之前寫到的實際類似，stringtie 本身給你的所謂基因定量結(jié)果稼稿，看的并不是基因（名）薄榛，而是基因的位置，如果你的兩坨轉(zhuǎn)錄本沒有交集让歼，就會被自動劃分為兩個位置進(jìn)行定量敞恋。作者特別用擬南芥的注釋文件來舉了個例子，并且說道我的軟件可不會真的相信注釋文是越，因為有的時候基因注釋文件中的id會出現(xiàn)重復(fù)的情況，所以你可以理解為 stringtie 把一些轉(zhuǎn)錄本沒有重疊的基因分成了兩個部分來獨立定量碌上。我們在意的是locus倚评。最后作者也給了一個建議，如果你不在乎馏予，可以把這些基因手動的合并一下天梧，但是如果一個基因竟然出現(xiàn)在基因組的兩個位置，這么做似乎也沒什么意義霞丧。

然后呢岗，提問者并沒有就此停下，他又提到了一個新的疑問，也就是我在上文有提到的后豫，不同的樣本重復(fù)情況不盡相同悉尾。這又是什么原因。

Ok, that makes a lot of sense, I like that stringtie identifies genes this way so that it does not combine only on a gene ID. What I am seeing though is that stringtie will "duplicate" some genes in only some of the output files.

It would make sense if it did it all the time, but I only see it happening in some of them. I guess what I am concerned about is that stringtie is Identifying different amounts of genes each time. So far for arabidopsis I have seen it identify either 37363 or 37364 genes.

From your last response, it seems that it should be identifying scenarios like the AT1G58520 and the AT1G58520.3 every single time it is run, not just occasionally. I was looking back at my data, and it appears that yes, this is true for AT1G58520, but it is not for ATCG09900. At the bottom of this response I have a grep of my gene abundance files for gene ATCG09900, and it appears that in some samples stringtie identifies 2 copies, but sometimes only 1.

I looked at the reference.gtf, and it looks like it should have 2 copies every time like AT1G58520 because even though it has the same gene ID, it has different genomic location. Your response above makes a lot of sense to me, stringtie should require both gene ID and genomic location to avoid bad reference.gtf files. My problem now is that I don't understand why in the case of ATCG09900 it is not being consistent across samples? I would think that the part of stringtie that identifies the gene from the reference.gtf would be independent of the sample.

Here is the reference.gtf for gene ATCG09900, I would expect that stringtie would always identify it as 2 genes from your last response:

ChrC    Araport11   exon    7967    7996    241.00  -   .   transcript_id "ATCG09900.1"; gene_id "ATCG09900"; gene_name "ATCG09900";
ChrC    Araport11   exon    8176    8207    241.00  -   .   transcript_id "ATCG09900.2"; gene_id "ATCG09900"; gene_name "ATCG09900";

Here is a section of a grep of my output gene abundance files that shows that ATCG09900 is sometimes represented by 2 copies, and sometimes only 1. It seems that its representation as 2 copies is independent of if it is actually expressed or not.

SRX2248582/SRX2248582_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    0.000000    0.000000    0.000000
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.0 0.0 0.0

這里提問者再一次詳細(xì)的貼出了自己的實際分析結(jié)果挫酿，也非常清楚的說明白了自己困惑构眯。于是，開發(fā)者又進(jìn)行了進(jìn)一步的回答早龟。

Good question -- apologies for leaving out an important piece of information in my previous explanation: the fact that overlaps with read alignments from the BAM file are also taken into account when determining a "gene region" (locus), so they can act as a "glue", or "bridge" which could put together "gene regions" otherwise separated when looking only at overlaps between reference transcripts. So it is likely that in some samples the two gene regions of ATCG09900 get "bridged" by read alignments overlapping both those regions.. Thus only one gene region is reported for such samples.

It is all related to the concept of "bundle" as used by StringTie. Read alignments and reference transcripts are binned together in a "bundle" defined as a transitive closure of the exon overlap relationship between these objects (read alignments or reference transcripts (guides)). StringTie analyses a "bundle" and unless "weak spots" are found (spurious alignments) to break the bundle into multiple regions, the "bundle" will end up as a "gene region" (locus) reported in the output.

This "clustering by overlap" approach can also have the downside of merging multiple otherwise clearly distinct gene regions together (i.e. with different reference gene IDs), when alignment artifacts and/or transcriptional noise artificially "bridge" over intergenic regions, in some rare cases (especially when the actual genes are very close to each other).

開發(fā)者先感謝對方提出了這樣一個好問題惫霸，他解釋道自己有一個重要的統(tǒng)計細(xì)節(jié)沒有講，就是當(dāng)決定基因范圍時葱弟，stringtie 還會考慮實際的數(shù)據(jù)比對情況壹店，reads 可以作為膠水將兩個分離的基因區(qū)域連接起來，如果有這樣的情況芝加，這個基因也不會被分開硅卢。當(dāng)然，最后作者也說了這樣處理數(shù)據(jù)的一些缺陷妖混。

這次老赤，提問者的困惑被解決了，他對自己的問題進(jìn)行了一個自我總結(jié)作為最后的回復(fù)制市，也作為對開發(fā)者回答的響應(yīng)抬旺。內(nèi)容如下：

Ahh, very good. So the samples where there is only 1 gene region means that there was either a bridge between the two transcripts, or there was no spurious alignment to break apart the bundle (in my case, this often meant no alignment).

With that info, I think I am able to answer my last question on my own. Above, some of the samples has genes that had fpkm of 0, but they still split, while others had fpkm of 0 and did not split. For example:

SRX2248582/SRX2248582_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    0.000000    0.000000    0.000000
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.0 0.0 0.0
SRX2248583/SRX2248583_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0

It would appear that SRX2248582 and SRX2248583 should both fail to break the bundle since no alignment has happened in either that would cause the bundle to break. This is the case in SRX2248583 but not in SRX2248582.

I went and looked at the sam file for these, and found that SRX2248582 had a read in that region, while SRX2248583 did not (I also checked the region 100 before, but none overlapped with the region so for brevity I will omit them):

$ grep ChrC *.sam| awk -F "\t" '{ if(($4 >= 7967 && $4 <= 8207)) { print } }'
SRR4426896.15866700 16  ChrC    8000    60  9M1I90M *   0   0   ATATATATATTTTTTTTTCATTTTCTATATTTTTTTCTATATTTTATTATATTATTATATATATATATATTCTTTTTGATTATTTGATTATATAAATATA    CEEDEEEDDDDDDFFFDDDHHGGHHJJJJIGIJIIJIJJJJJIJIHIIGIGGIGJJJJIJJJJJJJJJJJJJJJJJJJJJJJIJJJJHHHHHFFFFFCCC    AS:i:-8 ZS:i:-10    XN:i:0  XM:i:0  XO:i:1  XG:i:1  NM:i:1  MD:Z:99 YT:Z:UU NH:i:1
$ cd ../SRX2248583
$ grep ChrC *.sam| awk -F "\t" '{ if(($4 >= 7967 && $4 <= 8207)) { print } }'

I suppose that this spurious alignment caused the split, but was not counted towards the fpkm for some other reason.

Thank you for all of your help and quick response times, I appreciate it a lot.

在最后的總結(jié)中，作者又提到了一個觀察到的現(xiàn)象：一個基因在不同的樣本中定量的結(jié)果都為 0 祥楣，但是有一些樣本中這個基因被拆分為二开财，另外一些樣本則沒有。他認(rèn)為這可能就是開發(fā)者提到的是否存在一些reads 比對的問題误褪，于是他又檢查了不同樣本中這個基因所在位置區(qū)間的比對情況责鳍，果然和開發(fā)者提到的內(nèi)容相符。

反思

寫到這里兽间，忍不住在 GitHub 上評論了一下历葛，雖然 issue 已經(jīng)關(guān)閉了。

這次 debug 我最先猜測到的還是問題的表象嘀略，然后通過一些后續(xù)的分析找到了可能的問題恤溶。在看到上面這個完整的 issue 后，我深切地感受到了什么是一個優(yōu)秀的提問帜羊，他可以指引被提問對象說出關(guān)鍵的信息咒程，并且循序漸進(jìn)的做出需要的補充內(nèi)容。如果沒有提問者詳細(xì)的描述和認(rèn)真的思考讼育，可能這個問題不知道還要經(jīng)歷多少個來回才能被討論清楚帐姻。

經(jīng)歷了這么一番波折稠集，我對這個工具的喜歡竟然增加了一點，接下來就是再仔細(xì)看看文章和代碼饥瓷。也許剥纷，這就是為什么人總是會對傷害過自己的人和事念念不完。Bioinformatics was like a box of chocolates. You never konw what you're gonna get.

以及扛伍，如何提一個不錯的問題呢筷畦？

掃碼即刻交流