GATK4插件Mutect2參數(shù)解析

作者按

最近在準(zhǔn)備著換一個(gè)職業(yè)賽道,所以在做之前所有項(xiàng)目的回溯,遇到了最最基礎(chǔ)的SNV+Indel的流程,給別人重新講了一遍Mutect2的過濾規(guī)則和參數(shù)選擇奴艾,發(fā)現(xiàn)這個(gè),含金量比我之前寫的SV和CNV高多了脊阴。貼出來給我考試攢人品啦握侧。

Mutect2是基于somatic極大似然模型,通過尋找active region嘿期,迭代尋找可能突變的一款認(rèn)可度很高的軟件。

值得注意的是目前的Mutect2單獨(dú)call變異屬于BETA測試版埋合,只有tumor-normal成對(duì)call是經(jīng)過學(xué)術(shù)認(rèn)可的备徐。

但是目前各大醫(yī)院、臨檢單位甚颂,為了壓縮成本蜜猾、減輕患者壓力,普遍采取大panel像全外顯子這種的采取成對(duì)call振诬,小panel單獨(dú)only tumor蹭睡。通過SNP數(shù)據(jù)集,經(jīng)驗(yàn)值或者其他一代二代驗(yàn)證方法等等過濾產(chǎn)生的假陽性赶么。

那么過濾就需要細(xì)之又細(xì)肩豁,以下是Mutect2的所有參數(shù)。參考來源于Mutect官方文檔之Mathematical notes on Mutect的chapter8。

概述

Mutect2一共有14個(gè)過濾標(biāo)簽(vcf的filter列可能出現(xiàn)的tag)清钥,每個(gè)標(biāo)簽對(duì)應(yīng)一個(gè)或者好幾個(gè)值琼锋。vcf里每個(gè)點(diǎn)都有這14個(gè)值,值的意義在vcf的info列列出祟昭,見下表的key缕坎。這每個(gè)key或者filter對(duì)應(yīng)一個(gè)在運(yùn)行Mutect時(shí)候設(shè)置的參數(shù),見下表的Argument篡悟。

Mutect參數(shù)對(duì)應(yīng)的過濾標(biāo)簽和INFO列的對(duì)應(yīng)鍵值對(duì)

舉個(gè)例子:

見下圖的某位點(diǎn)谜叹,base_quality標(biāo)簽出現(xiàn)在了filter列,它在call的時(shí)候估算的參數(shù)為MBQ=0搬葬,而我們?cè)O(shè)置的min-median-base-quality參數(shù)為20叉谜,因?yàn)?<20,所以base_quality的標(biāo)簽出現(xiàn)在了tag里邊。其余參數(shù)以此類推踩萎。

pos example

像這種與likelihood model相關(guān)性弱一點(diǎn)的參數(shù)很好理解停局。萬一是t-lod標(biāo)簽?zāi)兀克硎裁匆馑寄叵愀坑质侨绾嗡愠鰜淼哪囟裕窟@就涉及到它的數(shù)學(xué)模型了,我已經(jīng)努力地在寫一篇比較淺顯易懂的介紹了企孩,但是因?yàn)镺neNote寫公式實(shí)在太累锭碳,所以我決定手推,等下次更新的時(shí)候就可以看到一篇全篇公式的手寫圖片式科普23333

我一定是準(zhǔn)備GMAT的數(shù)學(xué)太簡單了才如此找虐的勿璃。擒抛。。

好了补疑,言歸正傳歧沪,接下來依次介紹各個(gè)標(biāo)簽的含義:

參數(shù)含義

1. t_lod

  • tumor-lod is the minimum likelihood of an allele as determined by the somatic likelihoods model required to pass.
  • LOD threshold for calling tumor variant.
  • 似然模型中認(rèn)為該點(diǎn)是體細(xì)胞變異的最小似然比,默認(rèn)是5.3,若小于5.3則添加tlod標(biāo)簽莲组。
  • Default value: 5.3
  • Example


    t-lod example

2.clustered_events

  • max-events-in-region is the maximum allowable number of called variants co-occurring in a single assembly region. If the number of called variants exceeds this they will all be filtered.

  • Variants coming from an assembly region with more than this many events are filtered.

  • 活動(dòng)區(qū)域發(fā)生多次突變诊胞,且突變位點(diǎn)距離在3bp及以上∏妈荆【為什么大于3bp呢撵孤?因?yàn)?個(gè)的話很有可能是可以合并的同一個(gè)〗咄】

  • Default value: 2.

  • 因?yàn)檫@個(gè)曾經(jīng)困擾了我很久邪码,所以直接在GitHub扒了gatk源碼。查看的時(shí)候關(guān)于這個(gè)參數(shù)的部分長這樣


  • 然后又詢問官方咬清,得到的回復(fù)是建議認(rèn)為是假性闭专。


  • example



  • note
    如果是大panel奴潘,大部分位點(diǎn)都會(huì)帶這個(gè)標(biāo)簽,根據(jù)經(jīng)驗(yàn)喻圃,如果過濾萤彩,會(huì)過濾掉真陽性位點(diǎn),我個(gè)人建議保留斧拍,或者至少驗(yàn)證一下雀扶。【真的不能粗暴過濾肆汹,我驗(yàn)證過的】

    以下是我當(dāng)時(shí)調(diào)整參數(shù)--max-enents-in-region長度愚墓,得到clustered_events標(biāo)簽的個(gè)數(shù)。發(fā)現(xiàn)十幾bp的時(shí)候昂勉,針對(duì)我的panel浪册,帶標(biāo)簽的點(diǎn)并不會(huì)減少多少。所以用官方的

3.duplicated_evidence

  • unique-alt-read-count is the minimum number of unique (start position, fragment length) pairs required to make a call. This count is a proxy for the number of unique molecules (as opposed to PCR duplicates) supporting an allele.
  • 設(shè)置單獨(dú)支持該等位基因的最小分子數(shù)岗照,默認(rèn)值是0
  • Filter a variant if a site contains fewer than this many unique (i.e. deduplicated) reads supporting the alternate allele Default value: 0.
  • Example


4.multiallelic

  • max-alt-allele-count is the maximum allowable number of alt alleles at a site. By default only biallelic variants pass the filter.
  • 某一個(gè)點(diǎn)等位基因的最大數(shù)村象,默認(rèn)情況下,只有雙等位基因變異通過過濾器攒至。
  • Default value: 1.
  • filter variants with too many alt alleles
  • Example


5.germline_risk

  • max-germline-posterior is the maximum posterior probability, as determined by the above germline probability model, that a variant is a germline event.
  • 該位點(diǎn)是germline event的最大后驗(yàn)概率厚者, 根據(jù)模型計(jì)算P_GERMLINE值,大于設(shè)定值就會(huì)添加germline_risk的標(biāo)簽迫吐。
  • Maximum posterior probability that an allele is a germline variant.
  • Default value: 0.1.
  • Example


6.artifact_in_normal

  • normal-artifact-lod is the maximum acceptable likelihood of an allele in the normal by the somatic likelihoods model. This is different from the normal likelihood that goes into the germline model, which makes a diploid assumption. Here we compute the normal likelihood as if it were a tumor in order to detect artifacts.
  • 當(dāng)tumor和control成對(duì)call的時(shí)候库菲,會(huì)對(duì)control組的normal樣本單獨(dú)設(shè)置對(duì)數(shù)比閾值,該閾值越高志膀,過濾標(biāo)準(zhǔn)越嚴(yán)格熙宇,因?yàn)檎J(rèn)為normal全部是假陽性,所以會(huì)設(shè)置較低的LOD值溉浙。
  • LOD threshold for calling normal artifacts
  • Default value: 0.0.
  • example
    我沒有tumo-normal對(duì)烫止,所以無圖可用,可憐兮兮~~~

7.strand_artifact

  • max-strand-artifact-probability is the posterior probability of a strand artifact, as determined by the model described above, required to apply the strand artifact filter.
  • 鏈偏好性的后驗(yàn)概率放航,根據(jù)計(jì)算的SA_POST_PROB烈拒,大于設(shè)定值則過濾;還有第二層補(bǔ)充條件广鳍;
  • This is necessary but not sufficient – we also require the estimated max a posteriori allele fraction to be less than min-strand-artifact-allele-fraction.The second condition prevents filtering real variants that also have significant strand bias, i.e. a true variant that also has some artifactual reads.
  • 如果鏈偏好性的最大后驗(yàn)概率比SA_MAP_AF(MAP estimates of allele fraction given
  • 變異頻率的最大后驗(yàn)概率)值小,會(huì)保留吓妆,以防將真陽性位點(diǎn)加上strand_artifact標(biāo)簽赊时。
  • Filter a variant if the probability of strand artifact exceeds this number
  • Default value: 0.99.
  • Example


8.base_quality

  • min-median-base-quality is the minimum median base quality of bases supporting a SNV.
  • 支持SNV的最小堿基質(zhì)量。( median base quality of bases 堿基質(zhì)量衡量標(biāo)準(zhǔn)行拢,在fastqc軟件中該值為25)
  • filter variants for which alt reads' median base quality is too low.
  • Default value: 20.
  • Example


9.mapping_quality

  • min-median-mapping-quality is the minimum median mapping quality of reads supporting an allele.
  • 支持SNV的最小mapping質(zhì)量祖秒。
  • Filter variants for which alt reads' median mapping quality is too low.
  • Default value:30.
  • Example


10.fragment_length

  • max-median-fragment-length-difference is the maximum difference between the median fragment lengths reads supporting alt and reference alleles. Note that fragment length is based on where paired reads are mapped,not the actual physical fragment length.
  • 是支持alt和ref片段長度之間的最大差異。片段長度是基于mapping成對(duì)讀取時(shí)的長度,而不是實(shí)際物理片段長度竭缝。過濾掉ref和alt比對(duì)片段差異巨大的點(diǎn)房维。
  • Filter variants for which alt reads' median fragment length is very different from the median for ref reads.
  • Default value: 10000.
  • example
    這個(gè)也是tumor-normal成對(duì)call才會(huì)出現(xiàn)的,我沒有例子可展示抬纸,可憐兮兮X2

11.read_position

  • min-median-read-position is the minimum median length of bases supporting an allele from the closest end of the read. Indels positions are measured by the end farthest from the end of the read.
  • 位點(diǎn)到read末尾的最近讀取端的最小中值長度咙俩。DENELS的位置是由讀數(shù)末尾最遠(yuǎn)的一端測量的。
  • filter variants for which the median position of alt alleles within reads is too near the end of reads.
  • Default value: 5.
  • Example


12.panel_of_normals

  • One of the two unadjustable filters. the panel of normals filter removes all alleles at a site belonging to the panel of normals, which is a vcf of blacklisted artifact sites. It can be disabled by not passing a panel of normals to Mutect2.
  • 該位點(diǎn)也在PON中存在
  • Example


13.contamination

  • If FilterMutectCalls is passed a contamination-table from CalculateContamination it will filter alleles with allele fraction less than the whole-bam contamination in the table.
  • 過濾樣品污染
  • --contamination-table:File
  • example:
    單個(gè)樣品就不需要做啦

14.str_contraction

  • One of the two unadjustable filters, an STR contraction filter which removes variants that are the deletion of a single repeat unit of an STR when this repeat unit contains more than one base.

  • 當(dāng)該重復(fù)大于一個(gè)base湿故,過濾短串聯(lián)重復(fù)區(qū)阿趁,STR( Variant is a short tandem repeat )。

  • RPA=Number of times tandem repeat unit is repeated, for each allele (including reference)

  • RU=Tandem repeat unit (bases)

  • Example


寫在最后

很多參數(shù)坛猪,比如t-lod是模型最終的檢驗(yàn)T值脖阵,artifact_in_normal是normal的后驗(yàn)概率相關(guān),如果不了解模型墅茉,可能無法理解其假陽性的中間推導(dǎo)過程命黔。建議各位同學(xué)仔細(xì)讀一讀Mathematical Notes on Mutect(David Benjamin? and Takuto Sato?Broad Institute, 75 Ames Street, Cambridge, MA 02142
(Dated: September 26, 2018)。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末就斤,一起剝皮案震驚了整個(gè)濱河市悍募,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌战转,老刑警劉巖搜立,帶你破解...
    沈念sama閱讀 212,816評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異槐秧,居然都是意外死亡啄踊,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,729評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門刁标,熙熙樓的掌柜王于貴愁眉苦臉地迎上來颠通,“玉大人,你說我怎么就攤上這事膀懈《倜蹋” “怎么了?”我有些...
    開封第一講書人閱讀 158,300評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵启搂,是天一觀的道長硼控。 經(jīng)常有香客問我,道長胳赌,這世上最難降的妖魔是什么牢撼? 我笑而不...
    開封第一講書人閱讀 56,780評(píng)論 1 285
  • 正文 為了忘掉前任,我火速辦了婚禮疑苫,結(jié)果婚禮上熏版,老公的妹妹穿的比我還像新娘纷责。我一直安慰自己,他們只是感情好撼短,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,890評(píng)論 6 385
  • 文/花漫 我一把揭開白布再膳。 她就那樣靜靜地躺著,像睡著了一般曲横。 火紅的嫁衣襯著肌膚如雪喂柒。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 50,084評(píng)論 1 291
  • 那天胜榔,我揣著相機(jī)與錄音胳喷,去河邊找鬼。 笑死夭织,一個(gè)胖子當(dāng)著我的面吹牛吭露,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播尊惰,決...
    沈念sama閱讀 39,151評(píng)論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼讲竿,長吁一口氣:“原來是場噩夢(mèng)啊……” “哼!你這毒婦竟也來了弄屡?” 一聲冷哼從身側(cè)響起题禀,我...
    開封第一講書人閱讀 37,912評(píng)論 0 268
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎膀捷,沒想到半個(gè)月后迈嘹,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 44,355評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡全庸,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,666評(píng)論 2 327
  • 正文 我和宋清朗相戀三年秀仲,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片壶笼。...
    茶點(diǎn)故事閱讀 38,809評(píng)論 1 341
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡神僵,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出覆劈,到底是詐尸還是另有隱情保礼,我是刑警寧澤,帶...
    沈念sama閱讀 34,504評(píng)論 4 334
  • 正文 年R本政府宣布责语,位于F島的核電站炮障,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏坤候。R本人自食惡果不足惜铝阐,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 40,150評(píng)論 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望铐拐。 院中可真熱鬧徘键,春花似錦、人聲如沸遍蟋。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,882評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽虚青。三九已至它呀,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間棒厘,已是汗流浹背纵穿。 一陣腳步聲響...
    開封第一講書人閱讀 32,121評(píng)論 1 267
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留奢人,地道東北人谓媒。 一個(gè)月前我還...
    沈念sama閱讀 46,628評(píng)論 2 362
  • 正文 我出身青樓,卻偏偏與公主長得像何乎,于是被迫代替她去往敵國和親句惯。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,724評(píng)論 2 351