作者按
最近在準(zhǔn)備著換一個(gè)職業(yè)賽道,所以在做之前所有項(xiàng)目的回溯,遇到了最最基礎(chǔ)的SNV+Indel的流程,給別人重新講了一遍Mutect2的過濾規(guī)則和參數(shù)選擇奴艾,發(fā)現(xiàn)這個(gè),含金量比我之前寫的SV和CNV高多了脊阴。貼出來給我考試攢人品啦握侧。
Mutect2是基于somatic極大似然模型,通過尋找active region嘿期,迭代尋找可能突變的一款認(rèn)可度很高的軟件。
值得注意的是目前的Mutect2單獨(dú)call變異屬于BETA測試版埋合,只有tumor-normal成對(duì)call是經(jīng)過學(xué)術(shù)認(rèn)可的备徐。
但是目前各大醫(yī)院、臨檢單位甚颂,為了壓縮成本蜜猾、減輕患者壓力,普遍采取大panel像全外顯子這種的采取成對(duì)call振诬,小panel單獨(dú)only tumor蹭睡。通過SNP數(shù)據(jù)集,經(jīng)驗(yàn)值或者其他一代二代驗(yàn)證方法等等過濾產(chǎn)生的假陽性赶么。
那么過濾就需要細(xì)之又細(xì)肩豁,以下是Mutect2的所有參數(shù)。參考來源于Mutect官方文檔之Mathematical notes on Mutect的chapter8。
概述
Mutect2一共有14個(gè)過濾標(biāo)簽(vcf的filter列可能出現(xiàn)的tag)清钥,每個(gè)標(biāo)簽對(duì)應(yīng)一個(gè)或者好幾個(gè)值琼锋。vcf里每個(gè)點(diǎn)都有這14個(gè)值,值的意義在vcf的info列列出祟昭,見下表的key缕坎。這每個(gè)key或者filter對(duì)應(yīng)一個(gè)在運(yùn)行Mutect時(shí)候設(shè)置的參數(shù),見下表的Argument篡悟。
舉個(gè)例子:
見下圖的某位點(diǎn)谜叹,base_quality
標(biāo)簽出現(xiàn)在了filter列,它在call的時(shí)候估算的參數(shù)為MBQ=0搬葬,而我們?cè)O(shè)置的min-median-base-quality
參數(shù)為20叉谜,因?yàn)?<20,所以base_quality
的標(biāo)簽出現(xiàn)在了tag里邊。其余參數(shù)以此類推踩萎。
像這種與likelihood model相關(guān)性弱一點(diǎn)的參數(shù)很好理解停局。萬一是t-lod
標(biāo)簽?zāi)兀克硎裁匆馑寄叵愀坑质侨绾嗡愠鰜淼哪囟裕窟@就涉及到它的數(shù)學(xué)模型了,我已經(jīng)努力地在寫一篇比較淺顯易懂的介紹了企孩,但是因?yàn)镺neNote寫公式實(shí)在太累锭碳,所以我決定手推,等下次更新的時(shí)候就可以看到一篇全篇公式的手寫圖片式科普23333
我一定是準(zhǔn)備GMAT的數(shù)學(xué)太簡單了才如此找虐的勿璃。擒抛。。
好了补疑,言歸正傳歧沪,接下來依次介紹各個(gè)標(biāo)簽的含義:
參數(shù)含義
1. t_lod
- tumor-lod is the minimum likelihood of an allele as determined by the somatic likelihoods model required to pass.
- LOD threshold for calling tumor variant.
- 似然模型中認(rèn)為該點(diǎn)是體細(xì)胞變異的最小似然比,默認(rèn)是5.3,若小于5.3則添加tlod標(biāo)簽莲组。
- Default value: 5.3
-
Example
2.clustered_events
max-events-in-region is the maximum allowable number of called variants co-occurring in a single assembly region. If the number of called variants exceeds this they will all be filtered.
Variants coming from an assembly region with more than this many events are filtered.
活動(dòng)區(qū)域發(fā)生多次突變诊胞,且突變位點(diǎn)距離在3bp及以上∏妈荆【為什么大于3bp呢撵孤?因?yàn)?個(gè)的話很有可能是可以合并的同一個(gè)〗咄】
Default value: 2.
-
因?yàn)檫@個(gè)曾經(jīng)困擾了我很久邪码,所以直接在GitHub扒了gatk源碼。查看的時(shí)候關(guān)于這個(gè)參數(shù)的部分長這樣
-
然后又詢問官方咬清,得到的回復(fù)是建議認(rèn)為是假性闭专。
-
example
-
note
如果是大panel奴潘,大部分位點(diǎn)都會(huì)帶這個(gè)標(biāo)簽,根據(jù)經(jīng)驗(yàn)喻圃,如果過濾萤彩,會(huì)過濾掉真陽性位點(diǎn),我個(gè)人建議保留斧拍,或者至少驗(yàn)證一下雀扶。【真的不能粗暴過濾肆汹,我驗(yàn)證過的】以下是我當(dāng)時(shí)調(diào)整參數(shù)
--max-enents-in-region
長度愚墓,得到clustered_events
標(biāo)簽的個(gè)數(shù)。發(fā)現(xiàn)十幾bp的時(shí)候昂勉,針對(duì)我的panel浪册,帶標(biāo)簽的點(diǎn)并不會(huì)減少多少。所以用官方的
3.duplicated_evidence
- unique-alt-read-count is the minimum number of unique (start position, fragment length) pairs required to make a call. This count is a proxy for the number of unique molecules (as opposed to PCR duplicates) supporting an allele.
- 設(shè)置單獨(dú)支持該等位基因的最小分子數(shù)岗照,默認(rèn)值是0
- Filter a variant if a site contains fewer than this many unique (i.e. deduplicated) reads supporting the alternate allele Default value: 0.
-
Example
4.multiallelic
- max-alt-allele-count is the maximum allowable number of alt alleles at a site. By default only biallelic variants pass the filter.
- 某一個(gè)點(diǎn)等位基因的最大數(shù)村象,默認(rèn)情況下,只有雙等位基因變異通過過濾器攒至。
- Default value: 1.
- filter variants with too many alt alleles
-
Example
5.germline_risk
- max-germline-posterior is the maximum posterior probability, as determined by the above germline probability model, that a variant is a germline event.
- 該位點(diǎn)是germline event的最大后驗(yàn)概率厚者, 根據(jù)模型計(jì)算P_GERMLINE值,大于設(shè)定值就會(huì)添加germline_risk的標(biāo)簽迫吐。
- Maximum posterior probability that an allele is a germline variant.
- Default value: 0.1.
-
Example
6.artifact_in_normal
- normal-artifact-lod is the maximum acceptable likelihood of an allele in the normal by the somatic likelihoods model. This is different from the normal likelihood that goes into the germline model, which makes a diploid assumption. Here we compute the normal likelihood as if it were a tumor in order to detect artifacts.
- 當(dāng)tumor和control成對(duì)call的時(shí)候库菲,會(huì)對(duì)control組的normal樣本單獨(dú)設(shè)置對(duì)數(shù)比閾值,該閾值越高志膀,過濾標(biāo)準(zhǔn)越嚴(yán)格熙宇,因?yàn)檎J(rèn)為normal全部是假陽性,所以會(huì)設(shè)置較低的LOD值溉浙。
- LOD threshold for calling normal artifacts
- Default value: 0.0.
- example
我沒有tumo-normal對(duì)烫止,所以無圖可用,可憐兮兮~~~
7.strand_artifact
- max-strand-artifact-probability is the posterior probability of a strand artifact, as determined by the model described above, required to apply the strand artifact filter.
- 鏈偏好性的后驗(yàn)概率放航,根據(jù)計(jì)算的SA_POST_PROB烈拒,大于設(shè)定值則過濾;還有第二層補(bǔ)充條件广鳍;
- This is necessary but not sufficient – we also require the estimated max a posteriori allele fraction to be less than min-strand-artifact-allele-fraction.The second condition prevents filtering real variants that also have significant strand bias, i.e. a true variant that also has some artifactual reads.
- 如果鏈偏好性的最大后驗(yàn)概率比SA_MAP_AF(MAP estimates of allele fraction given
- 變異頻率的最大后驗(yàn)概率)值小,會(huì)保留吓妆,以防將真陽性位點(diǎn)加上strand_artifact標(biāo)簽赊时。
- Filter a variant if the probability of strand artifact exceeds this number
- Default value: 0.99.
-
Example
8.base_quality
- min-median-base-quality is the minimum median base quality of bases supporting a SNV.
- 支持SNV的最小堿基質(zhì)量。( median base quality of bases 堿基質(zhì)量衡量標(biāo)準(zhǔn)行拢,在fastqc軟件中該值為25)
- filter variants for which alt reads' median base quality is too low.
- Default value: 20.
-
Example
9.mapping_quality
- min-median-mapping-quality is the minimum median mapping quality of reads supporting an allele.
- 支持SNV的最小mapping質(zhì)量祖秒。
- Filter variants for which alt reads' median mapping quality is too low.
- Default value:30.
-
Example
10.fragment_length
- max-median-fragment-length-difference is the maximum difference between the median fragment lengths reads supporting alt and reference alleles. Note that fragment length is based on where paired reads are mapped,not the actual physical fragment length.
- 是支持alt和ref片段長度之間的最大差異。片段長度是基于mapping成對(duì)讀取時(shí)的長度,而不是實(shí)際物理片段長度竭缝。過濾掉ref和alt比對(duì)片段差異巨大的點(diǎn)房维。
- Filter variants for which alt reads' median fragment length is very different from the median for ref reads.
- Default value: 10000.
- example
這個(gè)也是tumor-normal成對(duì)call才會(huì)出現(xiàn)的,我沒有例子可展示抬纸,可憐兮兮X2
11.read_position
- min-median-read-position is the minimum median length of bases supporting an allele from the closest end of the read. Indels positions are measured by the end farthest from the end of the read.
- 位點(diǎn)到read末尾的最近讀取端的最小中值長度咙俩。DENELS的位置是由讀數(shù)末尾最遠(yuǎn)的一端測量的。
- filter variants for which the median position of alt alleles within reads is too near the end of reads.
- Default value: 5.
-
Example
12.panel_of_normals
- One of the two unadjustable filters. the panel of normals filter removes all alleles at a site belonging to the panel of normals, which is a vcf of blacklisted artifact sites. It can be disabled by not passing a panel of normals to Mutect2.
- 該位點(diǎn)也在PON中存在
-
Example
13.contamination
- If FilterMutectCalls is passed a contamination-table from CalculateContamination it will filter alleles with allele fraction less than the whole-bam contamination in the table.
- 過濾樣品污染
- --contamination-table:File
- example:
單個(gè)樣品就不需要做啦
14.str_contraction
One of the two unadjustable filters, an STR contraction filter which removes variants that are the deletion of a single repeat unit of an STR when this repeat unit contains more than one base.
當(dāng)該重復(fù)大于一個(gè)base湿故,過濾短串聯(lián)重復(fù)區(qū)阿趁,STR( Variant is a short tandem repeat )。
RPA=Number of times tandem repeat unit is repeated, for each allele (including reference)
RU=Tandem repeat unit (bases)
-
Example
寫在最后
很多參數(shù)坛猪,比如t-lod
是模型最終的檢驗(yàn)T值脖阵,artifact_in_normal
是normal的后驗(yàn)概率相關(guān),如果不了解模型墅茉,可能無法理解其假陽性的中間推導(dǎo)過程命黔。建議各位同學(xué)仔細(xì)讀一讀Mathematical Notes on Mutect(David Benjamin? and Takuto Sato?Broad Institute, 75 Ames Street, Cambridge, MA 02142
(Dated: September 26, 2018)。