全基因組關(guān)聯(lián)分析(GWAS)實(shí)操
數(shù)據(jù)及代碼來源:GWAS_in_R下面所用數(shù)據(jù)均可在此鏈接下載放椰,作者為githubwhussain2
1.表型數(shù)據(jù)的讀取
設(shè)置工作目錄
rm(list=ls())
setwd("G:\\GWAS\\R-for-Plant-Breeding-master\\GWAS_in_R")
讀取水稻表型數(shù)據(jù)
> rice.pheno <-read.table("./datafiles/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt",header=TRUE,stringsAsFactors =FALSE,sep ="\t")
#選擇適當(dāng)?shù)牧?/p>
>rice.pheno <- rice.pheno[,c(2,3,14)]
#將 NSFTV_ 粘貼到每個基因型名稱前
>?rice.pheno$NSFTVID <- paste0("NSFTV_",rice.pheno$NSFTVID)
刪除不存在數(shù)據(jù)的項(xiàng)
> rice.pheno <- na.omit(rice.pheno)
檢查數(shù)據(jù)的維度
> dim(rice.pheno)
?讀取標(biāo)記數(shù)據(jù)
從.ped文件中讀取基因型信息
> Geno <- read_ped("./datafiles/sativas413.ped")
此時可能會報錯
Error in read_ped("./datafiles/sativas413.ped") :
could not find function "read_ped"沒有發(fā)現(xiàn)函數(shù)read_ped
安裝R包BGLR可解決問題
>install.packages('BGLR')
載入包
>library(BGLR)
>Geno<-read_ped("./datafiles/sativas413.ped")
>p=Geno$p
>n=Geno$n
>Geno=Geno$x
從.fam文件中載入基因型和編號
>FAM <- read.table("./datafiles/sativas413.fam")
從.map文件中讀取數(shù)據(jù)
>MAP <- read.table("./datafiles/sativas413.map")
在.map文件中重新編碼標(biāo)記數(shù)據(jù)
> Geno[Geno ==2]<- NA
> Geno[Geno ==3] <- 2
將標(biāo)記數(shù)據(jù)轉(zhuǎn)換為矩陣并轉(zhuǎn)置并檢查尺寸
> Geno<-matrix(Geno,nrow = p,ncol = n,byrow = TRUE)
> Geno <-t(Geno)
> dim(Geno)
將行名和列名分配給標(biāo)記文件
> colnames(Geno)<-MAP$V2
添加存儲在第二列中的行名并將 NSFTV_ID_ 粘貼到每個行名
> row.names(Geno)<-paste0("NSFTV_",FAM$V2)
2.數(shù)據(jù)過濾
2.1 標(biāo)記數(shù)據(jù)質(zhì)控
2.1.1?SNP cal rate
檢查 snps 的缺失百分比
>miss_snp<- data.frame(SNPs=apply(Geno,2,function(x){ sum(is.na(x))/nrow(Geno)*100}))
繪制缺失數(shù)據(jù)的直方圖
>hist(miss_snp$SNPs,col = "lightblue",xlab = paste("Missing Percenatge"),ylab = "Proportion of markers",main = NULL)
標(biāo)記大于 25% 的缺失數(shù)據(jù)
> mrt<-which(miss_snp>25)
從原始文件中刪除這些標(biāo)記并保存
> Geno<-Geno[,-c(mrt)]
> dim(Geno)
>missing_markers<-subset(miss_snp, SNPs>25)
2.1.2 Genotype Call Rate
檢查樣本中的缺失百分比
> miss_geno<-data.frame(Genotypes=apply(Geno, 1, name <- function(x) {sum(is.na(x))/ncol(Geno)*100 }))
查看條形圖设塔,以便更好地可視化樣本中的缺失數(shù)據(jù)
> hist(miss_geno$Genotypes,col = "lightblue",ylab = "Proportion of genotypes",xlab = "Missing Percenatge",main = NULL)
識別缺失數(shù)據(jù)超過 15% 的樣本
> gen<-which(miss_geno>15)
> Geno<-Geno[-c(gen),]
> dim(Geno)
被丟棄的樣本列表
> missing_genotypes<-subset(miss_geno,Genotypes>15)
> head(missing_genotypes)
? ? ? ? ? Genotypes
NSFTV_72? ?15.97425
NSFTV_194? 16.60705
NSFTV_252? 18.38992
NSFTV_260? 17.04452
2.1.3檢查樣本間的雜合性
獲得雜合百分比
> htero_geno<-data.frame(htr=apply(Geno,1,function(x){(sum((x==1),na.rm = TRUE)/ncol(Geno)*100)}))
使用條形圖和直方圖進(jìn)行可視化
> par(mfrow=c(1,2))
> barplot(htero_geno$htr,xlab = "Genotypes",ylab = "Heterozygosity percentage",col = "darkblue")
> hist(htero_geno$htr,xlab = "Heterozygosity percentage",ylab = "Proportion of genotypes",col = "lightblue",main= NULL)
> htrg<-which(htero_geno>4)
> Geno<-Geno[-c(htrg),]
列出被刪除的基因型列表
> htrg<-subset(htero_geno,htr>4)
> dim(Geno)
寫在最后
明天繼續(xù)看看根據(jù)代碼重現(xiàn)修赞。從前看山是山,看水是水燃乍,現(xiàn)在看山還是山,看水還是水宛琅,沒有一點(diǎn)點(diǎn)改變刻蟹,似乎隨著時間的流逝,增長的只是我的年齡嘿辟。