GWAS表型的標準化方法一般有Quantile normalization、Inverse rank normalization、Z-score normalization等蝎抽。
各自區(qū)別如下:
一、Quantile normalization
該方法將每個樣本中表型值進行排序着降,然后將其規(guī)范化到一個標準分布技矮,通常是正態(tài)分布。規(guī)范化是通過將每個樣本的分布等同于目標分布來完成的估脆,使得同樣比例的樣本落在目標分布的每個值下方钦奋、上方或相等。這種方法確保了所有樣本中表型的分布是一致的疙赠。
優(yōu)點:可以消除由于偏斜導致的數(shù)據(jù)極端值和異常值的影響付材,而且對小批量數(shù)據(jù)的處理效果比較好。
使用該方法進行表型標準化的文章有:
1. Genome-wide association studies of brain imaging phenotypes in UK Biobank[J]. Nature, 2018, 562(7726): 210-216.
To ameliorate this, we quantile-normalized each of the image-derived phenotypes (IDPs) before association testing. This transformation also helped to avoid undue influence of outlier values.
https://www.nature.com/articles/s41586-018-0571-7
2. A multiple-phenotype imputation method for genetic studies[J]. Nature genetics, 2016, 48(4): 466-472.
Traits were mean and variance standardized and quantile normalized before analysis.
https://www.nature.com/articles/ng.3513
3. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology[J]. Nature communications, 2019, 10(1): 4064.
We applied quantile normalization for phenotype (--pheno-quantile-normalize option), where we fit a linear model with covariates and transform the phenotypes to normal distribution N(0,1).
https://www.nature.com/articles/s41467-019-11953-9
二圃阳、Inverse rank normalization
一種將數(shù)據(jù)轉(zhuǎn)換為正態(tài)分布的方法厌衔。該方法按照數(shù)據(jù)的大小對其進行排序,并將它們轉(zhuǎn)換為百分位(即在所有數(shù)據(jù)中占據(jù)的位置百分比)捍岳,并使用累積分布函數(shù)將這些百分位值轉(zhuǎn)換為標準正態(tài)分布中的z分數(shù)富寿。因此睬隶,排名越靠前的數(shù)據(jù)將被映射到較大的正態(tài)分布的值,排名越靠后的數(shù)據(jù)將被映射到較小的正態(tài)分布的值页徐。這種方法適用于數(shù)據(jù)集中有許多離群值或非正態(tài)分布時苏潜,它可以將數(shù)據(jù)的分布形態(tài)轉(zhuǎn)化為近似正態(tài)分布,方便后續(xù)的統(tǒng)計分析变勇。
使用該方法進行表型標準化的文章有:
Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index[J]. Nature genetics, 2010, 42(11): 937-948.
BMI was adjusted for age, age2 and other appropriate covariates (for example, principal components) and inverse normally transformed to a mean of 0 and a standard deviation of 1.
https://www.nature.com/articles/ng.686
三恤左、Z-score normalization
該方法是通過計算每個樣本的Z-score來標準化所有樣本中觀察到的表型值。Z-score 測量了一個特定樣本的表型與所有樣本的均值表型值相差多少個標準差搀绣。這種方法允許比較不同單位或量表的表型飞袋。
使用該方法進行表型標準化的文章有:
A genome-wide association study in 19 633 Japanese subjects identified LHX3-QSOX2 and IGF1 as adult height loci[J]. Human molecular genetics, 2010, 19(11): 2303-2312.
The scores were then normalized as Z scores. The effects of the Z scores on height were evaluated using the multivariate linear regression model incorporating height as a dependent variable and the Z scores, gender and age as the independent variables, using R statistical software. Differences in height between the subjects with low Z scores (less than or equal to ?2) and high Z scores (≥2) were obtained by comparing the means of the non-adjusted height between subject groups.
https://academic.oup.com/hmg/article-abstract/19/11/2303/579594
總結(jié)
Quantile normalization 適用于偏態(tài)分布或異常值較多的情況, Inverse rank normalization 在樣本量較小時具有更高的準確性豌熄,Z-score 適用于具有不同單位或量表的表型授嘀,允許比較不同單位或量表的表型。
從我查閱的文獻來看锣险,在GWAS中蹄皱,使用Quantile normalization進行連續(xù)型表型標準化的比較多。