分析流程-多基因風險分數(shù) PRS( Polygenic risk score)

sudo apt-get install zlib1g zlib1g.dev libblas3 libgfortran5 liblapack3 libquadmath0 plink1.9 unzip

sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common

sudo apt install r-base

1.獲取或者生成基礎(chǔ)數(shù)據(jù)（base data）

Polygenic Risk Score (PRS) 分析第一步就是獲得基礎(chǔ)數(shù)據(jù)（即GWAS統(tǒng)計分析結(jié)果）瀑粥，應(yīng)該包含了與性狀相關(guān)的所有等位基因信息及對應(yīng)效應(yīng)貢獻.
CHR: The chromosome in which the SNP resides---位于第幾條染色體
BP: Chromosomal co-ordinate of the SNP---位于染色體物理位置挣跋，可能出現(xiàn)的形式：POSITION
SNP: SNP ID, usually in the form of rs-ID---SNP編號，通常是rsID
A1: The effect allele of the SNP---效應(yīng)等位基因狞换，可能出現(xiàn)的形式：REF
A2: The non-effect allele of the SNP---非效應(yīng)等位基因避咆，可能出現(xiàn)的形式：ALT
N: Number of samples used to obtain the effect size estimate---用于評估效量值的群體數(shù)量
SE: The standard error (SE) of the effect size esimate---所評估效應(yīng)量值的標準誤差
P: The P-value of association between the SNP genotypes and the base phenotype---所評估表型和基因型的相關(guān)性p值，可能出現(xiàn)的形式：p_value
OR: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA---所評基因型的效應(yīng)量修噪，Odds Ratio或 Effect Size
INFO: The imputation information score---插補得分查库，可能出現(xiàn)的形式：INFO.plink
MAF: The minor allele frequency (MAF) of the SNP ---所評基因型的效應(yīng)量,可能出現(xiàn)的形式：ALT_FREQ、FRQ

我是從網(wǎng)上下載了別人的GWAS結(jié)果黄琼，是個TXT樊销，目的是要將數(shù)據(jù)從以下排序換成第二行所示：

Chromosome  Position    RSID    REF ALT ALT_FREQ    ALT_FREQ_1KGASN RSQ INFO    HWE_P   Pvalue  Qvalue  N   NullLogLike AltLogLike  SNPWeight   SNPWeightSE OddsRatio   WaldStat    NullLogDelta    NullGeneticVar  NullResidualVar NullBias

CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF

1.1 數(shù)據(jù)排序

首先把TXT轉(zhuǎn)換成CSV，再用PYTHON提取排序后再把CSV轉(zhuǎn)換成TXT

TXT轉(zhuǎn)CSV

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 11:25:52 2022

@author: Bohan
"""

import pandas as pd
df = pd.read_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.txt",delimiter="\t", low_memory=False)
df.to_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.csv", encoding='utf-8', index=False)

CSV提取排序

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 10:49:59 2022

@author: Bohan
"""

import csv

with open("MD_GWAS_SNPresults-original.csv") as f, open("originalarranged.csv","w",newline='') as tmp:
    r = csv.reader(f)
    wr = csv.writer(tmp)
    wr.writerows([a,b,c,d,e,m,q,k,r,i,f] for [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] in r)
#[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] 是原來文件的列順序
#[a,b,c,d,e,m,q,k,r,i,f]是按照原順序提取重排

CSV轉(zhuǎn)TXT

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 12:07:58 2022

@author: Bohan
"""

import csv
 
 
a=open('MD_GWAS_SNPresults-original.arranged.csv','r')
reader = csv.reader(a)
 
with open('MD_GWAS_SNPresults-original.arranged','w') as f:
    
    for i in reader:
        for x in i:
            f.write(x)
            f.write('\t')
        f.write('\n')
a.close()

搞完base數(shù)據(jù)看起來這樣

CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF 
10  90127   rs185642176 C   T   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905519504189  0.046   
10  90164   rs141504207 C   G   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905515996565  0.046   
10  94026   rs10904032  G   A   10640   0.004807929547222263    0.9907936601472477  1.000113527128114   0.946752744368  0.145

1.2 base數(shù)據(jù)質(zhì)檢

解壓讀取MD_GWAS_SNPresults-original.2015.arranged.txt.gz
輸出文件頭 (NR==1)
輸出MAF大于0.01的數(shù)據(jù)行 (第11列是MAF)
輸出INFO大于0.8 的數(shù)據(jù)行(第10列是INFO)
壓縮結(jié)果到MD2015.gz

gunzip -c MD_GWAS_SNPresults-original.2015.arranged.txt.gz |\
awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' |\
gzip  > MD2015.gz

根據(jù)第3列的數(shù)據(jù)去重獲得MD2015.nodup.gz（No-duplicate）

gunzip -c MD2015.gz |\
awk '{seen[$3]++; if(seen[$3]==1){ print}}' |\
gzip - > MD2015.nodup.gz

根據(jù)第四第五行數(shù)值去除Ambiguous SNPs

gunzip -c MD2015.nodup.gz |\
awk '!( ($4=="A" && $5=="T") || \
        ($4=="T" && $5=="A") || \
        ($4=="G" && $5=="C") || \
        ($4=="C" && $5=="G")) {print}' |\
    gzip > MD2015.QC.gz

2.目標數(shù)據(jù)（target data）的質(zhì)檢QC

本案例用的是illumina ASA v1 SNP芯片檢出的原始文件（*.idat文件）
直接去下載illumina的genomestudio軟件
https://files.softwaredownloads.illumina.com/5831d9df-95cb-4427-a7c0-499fe871e1d5/genomestudio-software-v2-0-5-0-installer.zip

2.1 創(chuàng)建分析項目

Workflow

1st Step

要用的還有Infinium Asian Screening Array v1.0 Manifest File (BPM Format - GRCh37)

Infinium Asian Screening Array v1.0 Cluster File

2.2 評估參照數(shù)據(jù)

要用的文件

創(chuàng)建完project和統(tǒng)計后，開始control評估

評估參考數(shù)據(jù)

內(nèi)建有參考marker

參照數(shù)據(jù)可以用來判斷實驗體系是否正常運行

2.3 評估樣品和SNP

image.png

具體理論可以參考這個GSA/ASA芯片質(zhì)控 - 簡書 (jianshu.com)
樣品評估
Call Rate（檢出率）樣本檢出率：是指對于某種樣本而言丈氓，通過測序并成功判刑的snp與所有檢出的snp的比值湾笛，通常標準在90%或以上嚎研。
LogR Dev用于評估是否有樣品污染
SNP評估
Call Frequency（檢出頻率）用于SNP檢出的樣品覆蓋程度
GenTrain Scoreillumina自己的算法

image.png

樣品評估
Call Rate太低的不要

image.png

SNP評估

image.png

基本流程就是創(chuàng)建樣品-添加manifest信息蚜退，參考基因組按照GWAS對應(yīng)版本钻注，利用自帶插件plink-input-report-plugin-v2-1-4到處.ped和.map文件

plink1.9 --file new --out new --make-bed

plink1.9 --bfile new --maf 0.01 --hwe 1e-6 --geno 0.02 --mind 0.02 --write-snplist --make-just-fam --out new.QC

plink1.9 --bfile new --keep new.QC.fam --extract new.QC.snplist --indep-pairwise 200 50 0.25 --out new.QC

plink1.9 --bfile new --extract new.QC.prune.in --keep new.QC.fam --het --out new.QC

sudo R
install.packages("data.table")
library(data.table)
dat <- fread("new.QC.het")
valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
fwrite(valid[,c("FID","IID")], "new.valid.sample", sep="\t")

install.packages("magrittr")
install.packages("R.utils")
library(magrittr)
bim <- fread("new.bim") %>%
     setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
    .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
MD <- fread("MD2015.QC.gz") %>%
.[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
qc <- fread("new.QC.snplist", header=F)

info <- merge(bim, MD, by=c("SNP", "CHR", "BP")) %>%
    .[SNP %in% qc[,V1]]

complement <- function(x){
    switch (x,
        "A" = "T",
        "C" = "G",
        "T" = "A",
        "G" = "C",
        return(NA)
    )

info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
com.snps <- info[sapply(B.A1, complement) == A1 &
                    sapply(B.A2, complement) == A2, SNP]

bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
        list(sapply(B.A1, complement),
            sapply(B.A2, complement))]

recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]

bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
        list(B.A2, B.A1)]

com.recode <- info[sapply(B.A1, complement) == A2 &
                    sapply(B.A2, complement) == A1, SNP]

bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
        list(sapply(B.A2, complement),
            sapply(B.A1, complement))]
fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")

mismatch <- bim[!(SNP %in% info.match |
                    SNP %in% com.snps |
                    SNP %in% recode.snps |
                    SNP %in% com.recode), SNP]
write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
q()


plink1.9 --bfile new --extract new.QC.prune.in --keep new.valid.sample --check-sex --out new.QC

valid <- fread("new.valid.sample")
dat <- fread("new.QC.sexcheck")[FID%in%valid$FID]
fwrite(dat[STATUS=="OK",c("FID","IID")], "new.QC.valid", sep="\t") 
q() # exit R

plink1.9 --bfile new  --extract new.QC.prune.in --keep new.QC.valid --rel-cutoff 0.125 --out new.QC

plink1.9 --bfile new --make-bed --keep new.QC.rel.id --out new.QC --extract new.QC.snplist


-----------------------------------------------------------------------------

library(data.table)
dat <- fread("MD2015.QC.gz")
fwrite(dat[,BETA:=log(OR)], "new.QC.Transformed", sep="\t")
q() # exit R


plink1.9  --bfile new.QC  --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump new.QC.Transformed --clump-snp-field SNP --clump-field P --out new

awk 'NR!=1{print $3}' new.clumped >  new.valid.snp

awk '{print $3,$8}' new.QC.Transformed > new.pvalue

echo "0.001 0 0.001" > range_list 
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list


plink1.9 --bfile new.QC --score new.QC.Transformed 3 4 12 header  --q-score-range range_list new.pvalue --extract new.valid.snp --out new


# First, we need to perform prunning
plink1.9 --bfile new.QC --indep-pairwise 200 50 0.25 --out new
# Then we calculate the first 6 PCs
plink1.9 --bfile new.QC --extract new.prune.in --pca 6 --out new

library(data.table)
library(magrittr)
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
phenotype <- fread("new.phenotype")
pcs <- fread("new.eigenvec", header=F) %>%
    setnames(., colnames(.), c("FID", "IID", paste0("PC",1:6)) )
covariate <- fread("new.cov")
pheno <- merge(phenotype, covariate) %>%
        merge(., pcs)
~~~~~
null.r2 <- summary(lm(Trait1~., data=pheno[,-c("FID", "IID")]))$r.squared
prs.result <- NULL
for(i in p.threshold){
    pheno.prs <- paste0("new.", i, ".profile") %>%
        fread(.) %>%
        .[,c("FID", "IID", "SCORE")] %>%
        merge(., pheno, by=c("FID", "IID"))

    model <- lm(Trait1~., data=pheno.prs[,-c("FID","IID")]) %>%
            summary
    model.r2 <- model$r.squared
    prs.r2 <- model.r2-null.r2
    prs.coef <- model$coeff["SCORE",]
    prs.result %<>% rbind(.,
        data.frame(Threshold=i, R2=prs.r2, 
                    P=as.numeric(prs.coef[4]), 
                    BETA=as.numeric(prs.coef[1]),
                    SE=as.numeric(prs.coef[2])))
}
print(prs.result[which.max(prs.result$R2),])
q() # exit R

p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)

Read in the phenotype file

phenotype <- read.table("new.phenotype", header=T)

Read in the PCs

pcs <- read.table("new.eigenvec", header=F)

The default output from plink does not include a header

To make things simple, we will add the appropriate headers

(1:6 because there are 6 PCs)

colnames(pcs) <- c("FID", "IID", paste0("PC",1:6))

Read in the covariates (here, it is sex)

covariate <- read.table("new.cov", header=T)

Now merge the files

pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))

We can then calculate the null model (model with PRS) using a linear regression

(as height is quantitative)

null.model <- glm(Trait1~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])

And the R2 of the null model is

null.r2 <- summary(null.model) $r.squared prs.result <- NULL for(i in p.threshold){ # Go through each p-value threshold prs <- read.table(paste0("new.",i,".profile"), header=T) # Merge the prs with the phenotype matrix # We only want the FID, IID and PRS from the PRS file, therefore we only select the # relevant columns pheno.prs <- merge(pheno, prs[,c("FID","IID", "SCORE")], by=c("FID", "IID")) # Now perform a linear regression on Height with PRS and the covariates # ignoring the FID and IID from our model model <- glm(Trait1~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")]) # model R2 is obtained as model.r2 <- summary(model)$ r.squared
# R2 of PRS is simply calculated as the model R2 minus the null R2
prs.r2 <- model.r2-null.r2
# We can also obtain the coeffcient and p-value of association of PRS as follow
prs.coef <- summary(model)$coeff["SCORE",]
prs.beta <- as.numeric(prs.coef[1])
prs.se <- as.numeric(prs.coef[2])
prs.p <- as.numeric(prs.coef[4])
# We can then store the results
prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}

Best result is:

prs.result[which.max(prs.result$R2),]
q() # exit R

最后編輯于：2022.07.27 10:58:01

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市系忙，隨后出現(xiàn)的幾起案子银还，更是在濱河造成了極大的恐慌蛹疯，老刑警劉巖捺弦，帶你破解...
沈念sama閱讀 216,372評論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異苦始，居然都是意外死亡陌选，警方通過查閱死者的電腦和手機蹄溉，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,368評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門臼勉，熙熙樓的掌柜王于貴愁眉苦臉地迎上來宴霸，“玉大人瓢谢，你說我怎么就攤上這事驮瞧÷郾剩” “怎么了？”我有些...
開封第一講書人閱讀 162,415評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長最楷。經(jīng)常有香客問我籽孙，道長犯建，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,157評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任谱仪，我火速辦了婚禮德迹，結(jié)果婚禮上揭芍，老公的妹妹穿的比我還像新娘称杨。我一直安慰自己姑原，他們只是感情好，可當我...
茶點故事閱讀 67,171評論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著唤殴，像睡著了一般朵逝。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上啤咽，一...
開封第一講書人閱讀 51,125評論 1贊 297
城市分裂傳說
那天宇整，我揣著相機與錄音没陡，去河邊找鬼盼玄。笑死埃儿，一個胖子當著我的面吹牛童番，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播轨香，決...
沈念sama閱讀 40,028評論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼臂容，長吁一口氣：“原來是場噩夢啊……” “哼脓杉！你這毒婦竟也來了球散？” 一聲冷哼從身側(cè)響起散庶，我...
開封第一講書人閱讀 38,887評論 0贊 274
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤嘁灯，失蹤者是張志新（化名）和其女友劉穎丑婿，沒想到半個月后没卸，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體约计，經(jīng)...
沈念sama閱讀 45,310評論 1贊 310
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡煤蚌，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,533評論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年筒占，在試婚紗的時候發(fā)現(xiàn)自己被綠了蜘犁。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,690評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡导披，死狀恐怖撩匕，靈堂內(nèi)的尸體忽然破棺而出止毕，到底是詐尸還是另有隱情滓技，我是刑警寧澤棚潦，帶...
沈念sama閱讀 35,411評論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布丸边，位于F島的核電站妹窖，受9級特大地震影響骄呼，放射性物質(zhì)發(fā)生泄漏蜓萄。R本人自食惡果不足惜澄峰，卻給世界環(huán)境...
茶點故事閱讀 41,004評論 3贊 325
男人毒藥：我在死后第九天來索命
文/蒙蒙一绸硕、第九天我趴在偏房一處隱蔽的房頂上張望魂毁。院中可真熱鬧席楚，春花似錦、人聲如沸刁赦。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,659評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽狡耻。三九已至夷狰，卻和暖如春沼头，著一層夾襖步出監(jiān)牢的瞬間书劝，已是汗流浹背购对。一陣腳步聲響...
開封第一講書人閱讀 32,812評論 1贊 268
情欲美人皮
我被黑心中介騙來泰國打工骡苞，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留解幽，地道東北人。一個月前我還...
沈念sama閱讀 47,693評論 2贊 368
代替公主和親
正文我出身青樓蝇刀，卻偏偏與公主長得像吞琐，于是被迫代替她去往敵國和親站粟。傳聞我的和親對象是個殘疾皇子奴烙，可洞房花燭夜當晚...
茶點故事閱讀 44,577評論 2贊 353

分析流程-多基因風險分數(shù) PRS( Polygenic risk score)

1.獲取或者生成基礎(chǔ)數(shù)據(jù)（base data）

1.1 數(shù)據(jù)排序

1.2 base數(shù)據(jù)質(zhì)檢

2.目標數(shù)據(jù)（target data）的質(zhì)檢QC

2.1 創(chuàng)建分析項目

2.2 評估參照數(shù)據(jù)

2.3 評估樣品和SNP

Read in the phenotype file

Read in the PCs

The default output from plink does not include a header

To make things simple, we will add the appropriate headers

(1:6 because there are 6 PCs)

Read in the covariates (here, it is sex)

Now merge the files

We can then calculate the null model (model with PRS) using a linear regression

(as height is quantitative)

And the R2 of the null model is

Best result is:

推薦閱讀更多精彩內(nèi)容