分析流程-多基因風險分數(shù) PRS( Polygenic risk score)

sudo apt-get install zlib1g zlib1g.dev libblas3 libgfortran5 liblapack3 libquadmath0 plink1.9 unzip

sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common

sudo apt install r-base

1.獲取或者生成基礎(chǔ)數(shù)據(jù)(base data)

Polygenic Risk Score (PRS) 分析第一步就是獲得基礎(chǔ)數(shù)據(jù)(即GWAS統(tǒng)計分析結(jié)果)瀑粥,應(yīng)該包含了與性狀相關(guān)的所有等位基因信息及對應(yīng)效應(yīng)貢獻.
CHR: The chromosome in which the SNP resides---位于第幾條染色體
BP: Chromosomal co-ordinate of the SNP---位于染色體物理位置挣跋,可能出現(xiàn)的形式:POSITION
SNP: SNP ID, usually in the form of rs-ID---SNP編號,通常是rsID
A1: The effect allele of the SNP---效應(yīng)等位基因狞换,可能出現(xiàn)的形式:REF
A2: The non-effect allele of the SNP---非效應(yīng)等位基因避咆,可能出現(xiàn)的形式:ALT
N: Number of samples used to obtain the effect size estimate---用于評估效量值的群體數(shù)量
SE: The standard error (SE) of the effect size esimate---所評估效應(yīng)量值的標準誤差
P: The P-value of association between the SNP genotypes and the base phenotype---所評估表型和基因型的相關(guān)性p值,可能出現(xiàn)的形式:p_value
OR: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA---所評基因型的效應(yīng)量修噪,Odds Ratio或 Effect Size
INFO: The imputation information score---插補得分查库,可能出現(xiàn)的形式:INFO.plink
MAF: The minor allele frequency (MAF) of the SNP ---所評基因型的效應(yīng)量,可能出現(xiàn)的形式:ALT_FREQ、FRQ

我是從網(wǎng)上下載了別人的GWAS結(jié)果黄琼,是個TXT樊销,目的是要將數(shù)據(jù)從以下排序換成第二行所示:

Chromosome  Position    RSID    REF ALT ALT_FREQ    ALT_FREQ_1KGASN RSQ INFO    HWE_P   Pvalue  Qvalue  N   NullLogLike AltLogLike  SNPWeight   SNPWeightSE OddsRatio   WaldStat    NullLogDelta    NullGeneticVar  NullResidualVar NullBias
CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF 
1.1 數(shù)據(jù)排序

首先把TXT轉(zhuǎn)換成CSV,再用PYTHON提取排序后再把CSV轉(zhuǎn)換成TXT

TXT轉(zhuǎn)CSV

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 11:25:52 2022

@author: Bohan
"""

import pandas as pd
df = pd.read_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.txt",delimiter="\t", low_memory=False)
df.to_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.csv", encoding='utf-8', index=False)

CSV提取排序

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 10:49:59 2022

@author: Bohan
"""

import csv

with open("MD_GWAS_SNPresults-original.csv") as f, open("originalarranged.csv","w",newline='') as tmp:
    r = csv.reader(f)
    wr = csv.writer(tmp)
    wr.writerows([a,b,c,d,e,m,q,k,r,i,f] for [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] in r)
#[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] 是原來文件的列順序
#[a,b,c,d,e,m,q,k,r,i,f]是按照原順序提取重排

CSV轉(zhuǎn)TXT

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 12:07:58 2022

@author: Bohan
"""

import csv
 
 
a=open('MD_GWAS_SNPresults-original.arranged.csv','r')
reader = csv.reader(a)
 
with open('MD_GWAS_SNPresults-original.arranged','w') as f:
    
    for i in reader:
        for x in i:
            f.write(x)
            f.write('\t')
        f.write('\n')
a.close()

搞完base數(shù)據(jù)看起來這樣

CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF 
10  90127   rs185642176 C   T   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905519504189  0.046   
10  90164   rs141504207 C   G   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905515996565  0.046   
10  94026   rs10904032  G   A   10640   0.004807929547222263    0.9907936601472477  1.000113527128114   0.946752744368  0.145   

1.2 base數(shù)據(jù)質(zhì)檢

解壓讀取MD_GWAS_SNPresults-original.2015.arranged.txt.gz
輸出文件頭 (NR==1)
輸出MAF大于0.01的數(shù)據(jù)行 (第11列是MAF)
輸出INFO大于0.8 的數(shù)據(jù)行(第10列是INFO)
壓縮結(jié)果到MD2015.gz

gunzip -c MD_GWAS_SNPresults-original.2015.arranged.txt.gz |\
awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' |\
gzip  > MD2015.gz

根據(jù)第3列的數(shù)據(jù)去重獲得MD2015.nodup.gz(No-duplicate)

gunzip -c MD2015.gz |\
awk '{seen[$3]++; if(seen[$3]==1){ print}}' |\
gzip - > MD2015.nodup.gz

根據(jù)第四第五行數(shù)值去除Ambiguous SNPs

gunzip -c MD2015.nodup.gz |\
awk '!( ($4=="A" && $5=="T") || \
        ($4=="T" && $5=="A") || \
        ($4=="G" && $5=="C") || \
        ($4=="C" && $5=="G")) {print}' |\
    gzip > MD2015.QC.gz

2.目標數(shù)據(jù)(target data)的質(zhì)檢QC

本案例用的是illumina ASA v1 SNP芯片檢出的原始文件(*.idat文件)
直接去下載illumina的genomestudio軟件
https://files.softwaredownloads.illumina.com/5831d9df-95cb-4427-a7c0-499fe871e1d5/genomestudio-software-v2-0-5-0-installer.zip

2.1 創(chuàng)建分析項目

Workflow

1st Step

要用的還有Infinium Asian Screening Array v1.0 Manifest File (BPM Format - GRCh37)

Infinium Asian Screening Array v1.0 Cluster File

2.2 評估參照數(shù)據(jù)

要用的文件
創(chuàng)建完project和統(tǒng)計后,開始control評估

評估參考數(shù)據(jù)
內(nèi)建有參考marker
參照數(shù)據(jù)可以用來判斷實驗體系是否正常運行

2.3 評估樣品和SNP

image.png

具體理論可以參考這個GSA/ASA芯片質(zhì)控 - 簡書 (jianshu.com)
樣品評估
Call Rate(檢出率) 樣本檢出率: 是指對于某種樣本而言丈氓,通過測序并成功判刑的snp與所有檢出的snp的比值湾笛,通常標準在90%或以上嚎研。
LogR Dev用于評估是否有樣品污染
SNP評估
Call Frequency(檢出頻率)用于SNP檢出的樣品覆蓋程度
GenTrain Scoreillumina自己的算法
image.png

image.png

image.png

image.png

image.png

image.png

image.png

樣品評估
Call Rate太低的不要
image.png

SNP評估
image.png

image.png

image.png

image.png

image.png

image.png
image.png

基本流程就是創(chuàng)建樣品-添加manifest信息蚜退,參考基因組按照GWAS對應(yīng)版本钻注,利用自帶插件plink-input-report-plugin-v2-1-4到處.ped和.map文件

plink1.9 --file new --out new --make-bed
plink1.9 --bfile new --maf 0.01 --hwe 1e-6 --geno 0.02 --mind 0.02 --write-snplist --make-just-fam --out new.QC
plink1.9 --bfile new --keep new.QC.fam --extract new.QC.snplist --indep-pairwise 200 50 0.25 --out new.QC
plink1.9 --bfile new --extract new.QC.prune.in --keep new.QC.fam --het --out new.QC
sudo R
install.packages("data.table")
library(data.table)
dat <- fread("new.QC.het")
valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
fwrite(valid[,c("FID","IID")], "new.valid.sample", sep="\t") 
install.packages("magrittr")
install.packages("R.utils")
library(magrittr)
bim <- fread("new.bim") %>%
     setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
    .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
MD <- fread("MD2015.QC.gz") %>%
.[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
qc <- fread("new.QC.snplist", header=F)
info <- merge(bim, MD, by=c("SNP", "CHR", "BP")) %>%
    .[SNP %in% qc[,V1]]

complement <- function(x){
    switch (x,
        "A" = "T",
        "C" = "G",
        "T" = "A",
        "G" = "C",
        return(NA)
    )

info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
com.snps <- info[sapply(B.A1, complement) == A1 &
                    sapply(B.A2, complement) == A2, SNP]

bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
        list(sapply(B.A1, complement),
            sapply(B.A2, complement))]

recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]

bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
        list(B.A2, B.A1)]

com.recode <- info[sapply(B.A1, complement) == A2 &
                    sapply(B.A2, complement) == A1, SNP]

bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
        list(sapply(B.A2, complement),
            sapply(B.A1, complement))]
fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")

mismatch <- bim[!(SNP %in% info.match |
                    SNP %in% com.snps |
                    SNP %in% recode.snps |
                    SNP %in% com.recode), SNP]
write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
q()


plink1.9 --bfile new --extract new.QC.prune.in --keep new.valid.sample --check-sex --out new.QC

valid <- fread("new.valid.sample")
dat <- fread("new.QC.sexcheck")[FID%in%valid$FID]
fwrite(dat[STATUS=="OK",c("FID","IID")], "new.QC.valid", sep="\t") 
q() # exit R

plink1.9 --bfile new  --extract new.QC.prune.in --keep new.QC.valid --rel-cutoff 0.125 --out new.QC

plink1.9 --bfile new --make-bed --keep new.QC.rel.id --out new.QC --extract new.QC.snplist


-----------------------------------------------------------------------------

library(data.table)
dat <- fread("MD2015.QC.gz")
fwrite(dat[,BETA:=log(OR)], "new.QC.Transformed", sep="\t")
q() # exit R


plink1.9  --bfile new.QC  --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump new.QC.Transformed --clump-snp-field SNP --clump-field P --out new

awk 'NR!=1{print $3}' new.clumped >  new.valid.snp

awk '{print $3,$8}' new.QC.Transformed > new.pvalue

echo "0.001 0 0.001" > range_list 
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list


plink1.9 --bfile new.QC --score new.QC.Transformed 3 4 12 header  --q-score-range range_list new.pvalue --extract new.valid.snp --out new


# First, we need to perform prunning
plink1.9 --bfile new.QC --indep-pairwise 200 50 0.25 --out new
# Then we calculate the first 6 PCs
plink1.9 --bfile new.QC --extract new.prune.in --pca 6 --out new

library(data.table)
library(magrittr)
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
phenotype <- fread("new.phenotype")
pcs <- fread("new.eigenvec", header=F) %>%
    setnames(., colnames(.), c("FID", "IID", paste0("PC",1:6)) )
covariate <- fread("new.cov")
pheno <- merge(phenotype, covariate) %>%
        merge(., pcs)
~~~~~
null.r2 <- summary(lm(Trait1~., data=pheno[,-c("FID", "IID")]))$r.squared
prs.result <- NULL
for(i in p.threshold){
    pheno.prs <- paste0("new.", i, ".profile") %>%
        fread(.) %>%
        .[,c("FID", "IID", "SCORE")] %>%
        merge(., pheno, by=c("FID", "IID"))

    model <- lm(Trait1~., data=pheno.prs[,-c("FID","IID")]) %>%
            summary
    model.r2 <- model$r.squared
    prs.r2 <- model.r2-null.r2
    prs.coef <- model$coeff["SCORE",]
    prs.result %<>% rbind(.,
        data.frame(Threshold=i, R2=prs.r2, 
                    P=as.numeric(prs.coef[4]), 
                    BETA=as.numeric(prs.coef[1]),
                    SE=as.numeric(prs.coef[2])))
}
print(prs.result[which.max(prs.result$R2),])
q() # exit R

p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)

Read in the phenotype file

phenotype <- read.table("new.phenotype", header=T)

Read in the PCs

pcs <- read.table("new.eigenvec", header=F)

The default output from plink does not include a header

To make things simple, we will add the appropriate headers

(1:6 because there are 6 PCs)

colnames(pcs) <- c("FID", "IID", paste0("PC",1:6))

Read in the covariates (here, it is sex)

covariate <- read.table("new.cov", header=T)

Now merge the files

pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))

We can then calculate the null model (model with PRS) using a linear regression

(as height is quantitative)

null.model <- glm(Trait1~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])

And the R2 of the null model is

null.r2 <- summary(null.model)r.squared prs.result <- NULL for(i in p.threshold){ # Go through each p-value threshold prs <- read.table(paste0("new.",i,".profile"), header=T) # Merge the prs with the phenotype matrix # We only want the FID, IID and PRS from the PRS file, therefore we only select the # relevant columns pheno.prs <- merge(pheno, prs[,c("FID","IID", "SCORE")], by=c("FID", "IID")) # Now perform a linear regression on Height with PRS and the covariates # ignoring the FID and IID from our model model <- glm(Trait1~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")]) # model R2 is obtained as model.r2 <- summary(model)r.squared
# R2 of PRS is simply calculated as the model R2 minus the null R2
prs.r2 <- model.r2-null.r2
# We can also obtain the coeffcient and p-value of association of PRS as follow
prs.coef <- summary(model)$coeff["SCORE",]
prs.beta <- as.numeric(prs.coef[1])
prs.se <- as.numeric(prs.coef[2])
prs.p <- as.numeric(prs.coef[4])
# We can then store the results
prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}

Best result is:

prs.result[which.max(prs.result$R2),]
q() # exit R

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市系忙,隨后出現(xiàn)的幾起案子银还,更是在濱河造成了極大的恐慌蛹疯,老刑警劉巖捺弦,帶你破解...
    沈念sama閱讀 216,372評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異苦始,居然都是意外死亡陌选,警方通過查閱死者的電腦和手機蹄溉,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評論 3 392
  • 文/潘曉璐 我一進店門臼勉,熙熙樓的掌柜王于貴愁眉苦臉地迎上來宴霸,“玉大人瓢谢,你說我怎么就攤上這事驮瞧÷郾剩” “怎么了?”我有些...
    開封第一講書人閱讀 162,415評論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長最楷。 經(jīng)常有香客問我籽孙,道長犯建,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,157評論 1 292
  • 正文 為了忘掉前任谱仪,我火速辦了婚禮德迹,結(jié)果婚禮上揭芍,老公的妹妹穿的比我還像新娘称杨。我一直安慰自己姑原,他們只是感情好,可當我...
    茶點故事閱讀 67,171評論 6 388
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著唤殴,像睡著了一般朵逝。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上啤咽,一...
    開封第一講書人閱讀 51,125評論 1 297
  • 那天宇整,我揣著相機與錄音没陡,去河邊找鬼盼玄。 笑死埃儿,一個胖子當著我的面吹牛童番,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播轨香,決...
    沈念sama閱讀 40,028評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼臂容,長吁一口氣:“原來是場噩夢啊……” “哼脓杉!你這毒婦竟也來了球散?” 一聲冷哼從身側(cè)響起散庶,我...
    開封第一講書人閱讀 38,887評論 0 274
  • 序言:老撾萬榮一對情侶失蹤嘁灯,失蹤者是張志新(化名)和其女友劉穎丑婿,沒想到半個月后没卸,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體约计,經(jīng)...
    沈念sama閱讀 45,310評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡煤蚌,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,533評論 2 332
  • 正文 我和宋清朗相戀三年筒占,在試婚紗的時候發(fā)現(xiàn)自己被綠了蜘犁。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 39,690評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡导披,死狀恐怖撩匕,靈堂內(nèi)的尸體忽然破棺而出止毕,到底是詐尸還是另有隱情滓技,我是刑警寧澤棚潦,帶...
    沈念sama閱讀 35,411評論 5 343
  • 正文 年R本政府宣布丸边,位于F島的核電站妹窖,受9級特大地震影響骄呼,放射性物質(zhì)發(fā)生泄漏蜓萄。R本人自食惡果不足惜澄峰,卻給世界環(huán)境...
    茶點故事閱讀 41,004評論 3 325
  • 文/蒙蒙 一绸硕、第九天 我趴在偏房一處隱蔽的房頂上張望魂毁。 院中可真熱鬧席楚,春花似錦、人聲如沸刁赦。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,659評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽狡耻。三九已至夷狰,卻和暖如春沼头,著一層夾襖步出監(jiān)牢的瞬間书劝,已是汗流浹背购对。 一陣腳步聲響...
    開封第一講書人閱讀 32,812評論 1 268
  • 我被黑心中介騙來泰國打工骡苞, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留解幽,地道東北人。 一個月前我還...
    沈念sama閱讀 47,693評論 2 368
  • 正文 我出身青樓蝇刀,卻偏偏與公主長得像吞琐,于是被迫代替她去往敵國和親站粟。 傳聞我的和親對象是個殘疾皇子奴烙,可洞房花燭夜當晚...
    茶點故事閱讀 44,577評論 2 353

推薦閱讀更多精彩內(nèi)容