Part1: gnomAD-LOEUF數(shù)據(jù)下載
? ? 上一篇文章已經(jīng)講解了gnomAD的flagship paper枪芒,文章中新建的評估基因?qū)LoF突變?nèi)淌芰Φ哪P汀狶OEUF我擂,想必大家肯定都很想趕緊用起來玄妈。
? ? 這個模型的數(shù)據(jù)結(jié)果即可以在文章的supplymentary中找到(就是supplementary_dataset_11_full_constraint_metrics.tsv.gz)扶镀。也可以爬上梯子浪慌,在gnomAD的官網(wǎng) → 右上角Downloads → gnomad.v2.1.1系列的Constraint,鏈接到相應(yīng)位置進行下載苛蒲,官網(wǎng)提供了:①只有經(jīng)典轉(zhuǎn)錄本的gene list表格爽丹、②包含了多個轉(zhuǎn)錄本的transcript list表格(和文章附表內(nèi)容相同)筑煮、③按人群數(shù)量做了降采樣的E.O值辛蚊。
Part2: 表格數(shù)據(jù)意義(翻譯一下)
? 以full_constraint_metrics,即lof_metrics.by_transcript為例真仲。
該文檔共80950行袋马,其中包含了19600+個基因的經(jīng)典轉(zhuǎn)錄本和其他常見轉(zhuǎn)錄本。共78列秸应,每列的header解釋如下(詳見supplementary information文檔第74頁):
gene: Gene name虑凛,基因名稱
transcript: Ensembl transcript ID (Gencode v19),轉(zhuǎn)錄本編號
canonical: Boolean indicator as to whether the transcript is the canonical transcript for the gene灸眼,是否是該基因的經(jīng)典轉(zhuǎn)錄本
obs_XXX: Number of observed XXX variants in transcript卧檐,在該轉(zhuǎn)錄本上觀察到XXX突變的數(shù)量(XXX=mis錯義墓懂、syn同義焰宣、lof功能缺失)
exp_XXX: Number of expected XXX variants in transcript,在該轉(zhuǎn)錄本上預(yù)測到XXX突變的數(shù)量
oe_XXX: Observed over expected ratio for XXX variants in transcript (obs_XXX divided by exp_XXX)捕仔,在該轉(zhuǎn)錄本上觀察到XXX變異超出預(yù)期的比率
mu_XXX: Mutation rate summed across all possible XXX variants in transcript匕积,該轉(zhuǎn)錄本中所有可能的XXX變異的突變率總和
possible_XXX: Number of possible XXX variants in transcript,該轉(zhuǎn)錄中可能的XXX突變的數(shù)量(其實不是很理解這個)
obs_XXX_pphen: Number of observed XXX variants in transcript predicted "probably damaging" by PolyPhen-2榜跌,被PolyPhen-2預(yù)測為“可能有害”的闪唆、觀察到的XXX突變數(shù)量
exp_XXX_pphen: Number of expected XXX variants in transcript predicted "probably damaging" by PolyPhen-2,被PolyPhen-2預(yù)測為“可能有害”的钓葫、預(yù)測察到的XXX突變數(shù)量
oe_XXX_pphen: Observed over expected ratio for PolyPhen-2 predicted "probably damaging" XXX variants in transcript (obs_mis_pphen divided by exp_mis_pphen)悄蕾,被PolyPhen-2預(yù)測為“可能有害”的XXX突變,觀察到超過預(yù)期的比率
possible_XXX_pphen: Number of possible missense variants in transcript that are predicted "probably damaging" by PolyPhen-2础浮,被PolyPhen-2預(yù)測為“可能有害”的帆调、可能的XXX突變的數(shù)量(其實也不是很理解這個)
oe_XXX_lower: Lower bound of 90% confidence interval for o/e ratio for XXX variants,XXX突變的o/e比率90%置信區(qū)間的下界
oe_XXX_upper: Upper bound of 90% confidence interval for o/e ratio for XXX variants豆同,XXX突變的o/e比率90%置信區(qū)間的上界
XXX_z: Z score for XXX variants in gene. Higher (more positive) Z scores indicate that the transcript is more intolerant of variation (more constrained). Extreme values of XXX_z indicate likely data quality issues番刊,基因中XXX突變的Z-score。Z-score越高(越陽性)表明該轉(zhuǎn)錄本越不耐受XXX變異(越受限制)影锈。XXX_z的極端值表示可能存在數(shù)據(jù)質(zhì)量問題芹务。
pLI: Probability of loss-of-function intolerance; probability that transcript falls into distribution of haploinsufficient genes (~9% o/e pLoF ratio; computed from gnomAD data),用gnomAD數(shù)據(jù)計算出來的pLI
pRec: Probability that transcript falls into distribution of recessive genes (~46% o/e pLoF ratio; computed from gnomAD data)鸭廷,該轉(zhuǎn)錄本屬于隱性基因的概率
pNull: Probability that transcript falls into distribution of unconstrained genes (~100% o/e pLoF ratio; computed from gnomAD data)枣抱,該轉(zhuǎn)錄本屬于非約束基因的概率
oe_lof_upper_rank: Transcript’s rank of LOEUF value compared to all transcripts (lower values indicate more constrained),與所有轉(zhuǎn)錄本相比辆床,該轉(zhuǎn)錄本的LOEUF值的排名(較低的值表示更受限制)
oe_lof_upper_bin: Decile bin of LOEUF for given transcript (lower values indicate more constrained)佳晶,該轉(zhuǎn)錄本在十分位分類中的位置(較低的值表示更受限制)
(以上2個主要是是表示LOEUF的排序、decile 分類指標)
oe_lof_upper_bin_6: Sextile bin of LOEUF for given transcript (lower values indicate more constrained)佛吓,該轉(zhuǎn)錄本在六分位分類中的位置(較低的值表示更受限制)
n_sites: Number of distinct pLoF variant sites in the transcript宵晚,該轉(zhuǎn)錄體中不同lof突變位點的數(shù)量
classic_caf: Sum of allele frequencies of pLoFs in the transcript垂攘,該轉(zhuǎn)錄本中的pLoFs的等位基因頻率的總和
max_af: Maximum allele frequency of any pLoF in the transcript,該轉(zhuǎn)錄本中的任一pLoF的最大等位基因頻率
no_lofs: The number of individuals with no observed pLoF variants in the transcript淤刃,在該轉(zhuǎn)錄本中觀察到pLoF變異的個體數(shù)量
obs_het_lof: The number of individuals with at least one observed heterozygous pLoF variant, but no homozygous pLoF variants, in the transcript晒他,在該轉(zhuǎn)錄本中觀察到至少一個雜合pLoF變異,但沒有純合逸贾,的個體數(shù)量
obs_hom_lof: The number of individuals with at least one observed homozygous pLoF in the transcript陨仅,在該轉(zhuǎn)錄本中觀察到至少一個純合pLoF變異的個體數(shù)量
defined: The number of individuals where at least one high-quality genotype (including homozygous reference) is observed at a called site annotated as a pLoF variant,至少有觀察到一個高質(zhì)量的pLoF突變的個體數(shù)量
p: The estimated proportion of haplotypes with a pLoF variant. Defined as: 1 - sqrt(no_lofs / defined) 一個pLoF突變的單倍型的估計比例铝侵。
exp_hom_lof: The expected number of individuals with at least one homozygous pLoF variant based on the frequency of pLoF haplotypes. Defined as: defined * p2灼伤,根據(jù)pLoF的單倍型頻率計算,至少有一個純合pLoF突變的預(yù)期個體數(shù)量咪鲜。
classic_caf_POP: Sum of allele frequencies of pLoFs in the transcript among POP individuals狐赡,POP人群中pLoFs的等位基因頻率的總和
p_POP: The computation of `p` repeated among only POP individuals,只在POP群體中重復(fù)的'p'值
transcript_type: Transcript biotype (https://www.gencodegenes.org/pages/biotypes.html)疟丙,轉(zhuǎn)錄本生物型
gene_id: Ensembl gene ID颖侄,Ensembl 的基因編號
transcript_level: Transcript level from Gencode (https://www.gencodegenes.org/pages/data_format.html),來自Gencode的轉(zhuǎn)錄水平
cds_length: Length of coding sequence in gene享郊,該基因的編碼序列長度
num_coding_exons: Number of coding exons in gene览祖,該基因上編碼外顯子的數(shù)量
gene_type: Gene biotype (https://www.gencodegenes.org/pages/biotypes.html),基因生物型
gene_length: Length of gene炊琉,基因的長度
exac_pLI: pLI score calculated from ExAC展蒂,在ExAC中計算得到的pLI值
exac_obs_lof: Number of observed pLoF variants in gene in ExAC,在ExAC中pLoF突變的觀察數(shù)量
exac_exp_lof: Number of expected pLoF variants in gene in ExAC苔咪,在ExAC中pLoF突變的預(yù)測數(shù)量
exac_oe_lof: Observed to expected ratio of pLoF variants in ExAC锰悼,在ExAC中pLoF突變的觀察與預(yù)期的比率
brain_expression: Expression of gene in brain from GTEx data,GTEx數(shù)據(jù)中該基因在腦部的表達
chromosome: Chromosome name悼泌, 染色體
start_position: Start position of gene松捉,該基因的起始位置
end_position: End position of gene,該基因的終止始位置)
Part3: 使用annovar注釋
注釋數(shù)據(jù)庫文件制作:
? ? 根據(jù)對header的理解馆里,選用了默認canonical=TRUE的"gnomad.v2.1.1.lof_metrics.by_gene.txt"作為數(shù)據(jù)庫數(shù)據(jù)來源隘世;
? ? 方便起見,將最后3列移到了最前面鸠踪;然后挑選了一些我自己認為對我做疾病分析可能比較重要的列:
gene丙者,gene_id,transcript营密,oe_lof_upper_bin械媒,oe_lof_upper_rank, pLI,pRec纷捞,pNull痢虹,transcript_level
? ? 制作成了一個12列的文件,命名為“hg19_LOEUF.txt”主儡。數(shù)據(jù)量不是很大的話奖唯,做不做該文件的annovar_index都可以,想做的話糜值,構(gòu)建索引的程序index_annovar.pl可以給王凱老師發(fā)郵件獲得丰捷。
? ? 另外,由于ANNOVAR的region_base注釋數(shù)據(jù)庫不能有header寂汇,而且所有數(shù)據(jù)會被合并到一列里面病往,需要后續(xù)再自己拆開,所以上面這個順序需要另外記錄好哦~
ANNOVAR注釋腳本的修改:?
? ? 這里感謝TSY小伙伴之前的嘗試和經(jīng)驗分享骄瓣,在annotate_variation.pl文件的3084行加入下面這段elseif:
注釋一下:
$annovar/convert2annovar.pl -format vcf4 sample.vcf > sample.avinput
$annovar/table_annovar.pl sample.avinput $humandb --buildver hg19 -out sample_anno_LOEUF? -remove -protocol refGene,ensGene,LOEUF -operation g,g,r --nastring . -csvout
同時把refGene和ensGene注釋上去了停巷,為了比較一下注釋結(jié)果中的gene名和ID是不是都能匹配上。
? ? 這個結(jié)果可以按需要再進行分列處理累贤。如果對其他列的信息也感興趣想進行注釋叠穆,可以參考上面的步驟進行修改和使用哦少漆。
To Wang Lab 小伙伴:
? ? ANNOVAR比較適合做突變位點的注釋臼膏,在這個region_base的注釋中,只能覆蓋到這些基因的基因內(nèi)區(qū)域示损,如coding區(qū)和intron區(qū)渗磅,如果你們的SNP是在基因間區(qū)的話應(yīng)該不能直接注釋〖旆茫可以基于最原始的表格始鱼,直接用你們感興趣的基因去做匹配,主要看“oe_lof_upper_bin”列的值脆贵。