3K水稻SNP數(shù)據(jù)集的簡(jiǎn)單利用

3010份亞洲稻群體重測(cè)序項(xiàng)目是由中國(guó)農(nóng)業(yè)科學(xué)院作物科學(xué)研究所牽頭误褪,聯(lián)合國(guó)際水稻研究所春叫、華大基因等16家單位共同完成肩钠。該研究對(duì)來(lái)自89個(gè)國(guó)家的3,010份水稻品種進(jìn)行了重測(cè)序研究(resequence)泣港,參考的是日本晴(Nipponbare)基因組。所有3,010個(gè)基因組的平均測(cè)序深度(average sequencing depth)為14×,平均基因組覆蓋率(average genome coverage)和繪圖率(average mapping rate)分別為94.0%和92.5%价匠。
該項(xiàng)目的意義自然不用多說(shuō)当纱,并且該項(xiàng)目的測(cè)序數(shù)據(jù)都可以在國(guó)際水稻所CNGB上下載。我們可以從原始序列開始組裝得到大的結(jié)構(gòu)變異 或call到一些稀有SNP來(lái)做研究踩窖。當(dāng)然以上想法不在本文討論范圍內(nèi)坡氯,本文主要討論只想知道幾個(gè)基因區(qū)域上 SNP 及其分化系數(shù)等群體指標(biāo)的計(jì)算。

2021.09.15 更新了單倍型分型洋腮、作圖和方差分析箫柳。

放在前面

RFGB可以極大程度上滿足查找某個(gè)基因上的SNP&indel和單倍型查找。并提供15個(gè)表型可以進(jìn)行單倍型分析啥供,也可以上傳表型進(jìn)行分析悯恍。接下來(lái)探索一下。

單倍型分析

這個(gè)頁(yè)面提供單倍型分析伙狐。分別可以用基因名涮毫,染色體位置和特定的SNP來(lái)進(jìn)行分析。提供MAF是最小等位基因頻率贷屎,建議選一下罢防。可以用全部的3024個(gè)品種進(jìn)行查找也可以分亞群進(jìn)行查找,也提供了表型關(guān)聯(lián)分析唉侄。
結(jié)果

結(jié)果

這里是用的LOC_Os08g33530 CDS 區(qū)域咒吐,選的秈稻群體,共找到20個(gè)SNP属划,和16個(gè)單倍型恬叹。其中hap11的谷粒長(zhǎng)最高。
這個(gè)網(wǎng)站用來(lái)做單倍型都是SNP榴嗅,而且是復(fù)等位的妄呕,就是沒找到在哪下數(shù)據(jù)集。
可惜沒有LD的圖和pi值嗽测。
但我找到了水稻單倍型的數(shù)據(jù)绪励,具體可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
數(shù)據(jù)地址


也可以用這個(gè)單倍型數(shù)據(jù)和表型做關(guān)聯(lián)分析,用方差分析就可以唠粥。表型數(shù)據(jù)下載


以下主要是3k構(gòu)建的SNP的探索以及簡(jiǎn)單利用疏魏。

工具準(zhǔn)備

R
R package(LDheatmap,genetics)
plink1.9(3個(gè)操作系統(tǒng)版本都有)下載地址
vcftools(只有l(wèi)iunx版本,只是用來(lái)算群體遺傳統(tǒng)計(jì))下載地址

數(shù)據(jù)下載

可以在IRRI上下載到雙等位SNP&indel數(shù)據(jù)(多等位的SNP貌似只能從頭call晤愧,沒找到下載的地方)大莫。bed、bim官份、fam文件為一組只厘,bed 是存放位點(diǎn)信息的二進(jìn)制格式文件,用文本打開會(huì)發(fā)現(xiàn)是亂碼烙丛;bim是存放位點(diǎn)信息的文件,類似于map格式羔味;fam是存放樣本信息的文件河咽,第一列為FID,第二列為IID。

bim

fam

Nipponbare_indel 是雙等位indel數(shù)據(jù)集赋元,共有2.3m個(gè)忘蟹,但是是沒有經(jīng)過(guò)篩選的,后續(xù)需要加工下搁凸。
雙等位的SNP標(biāo)記共5個(gè)媚值。
NB_final_snp 包含了所有的SNP,共29m個(gè)护糖,是沒有經(jīng)過(guò)篩選的褥芒。
Base 的SNP是經(jīng)過(guò)基礎(chǔ)篩選的,共18m個(gè)椅文。
base_filtered_v0.7 是從Base的SNP進(jìn)一步篩選的喂很,共4.2m個(gè)惜颇,本文主要利用這個(gè)數(shù)據(jù)集皆刺。
pruned_v2.1 是2.1版本的篩選方案,包含1m個(gè)SNP凌摄。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基礎(chǔ)上進(jìn)一步篩選 共 404k 個(gè)羡蛾,在用的時(shí)候發(fā)現(xiàn)很多基因區(qū)域上都沒有SNP,畢竟個(gè)數(shù)少锨亏,比較稀疏痴怨。
各個(gè)數(shù)據(jù)集的篩選方法可以看看里邊的readme文件,不同的篩選方案主要目的就是提高精度器予,找到因果變異浪藻。

準(zhǔn)備目標(biāo)基因的位置

其實(shí)只需要染色體號(hào)和起始位置就好了,基因名字寫啥都行乾翔,就是拿來(lái)做圖片標(biāo)題的爱葵。忘記提了,這個(gè)數(shù)據(jù)集中沒有線粒體和葉綠體上的標(biāo)記反浓。建議起始位置可以把上游和下游都囊括進(jìn)去萌丈,多個(gè)2kb,3kb 都是可以的雷则,多找?guī)讉€(gè)總是好的辆雾,萬(wàn)一和表型關(guān)聯(lián)上了呢。
基因列表月劈,以tab分割

合并想要的SNP數(shù)據(jù)集&indel數(shù)據(jù)集

其實(shí)本可以一個(gè)個(gè)數(shù)據(jù)集來(lái)度迂,但我想把2.3m個(gè)indel和4.2m個(gè)SNP標(biāo)記合在一起分析咋辦呢藤乙。最簡(jiǎn)單的辦法還是用vcftools或bcftools合并兩個(gè)數(shù)據(jù)集〔涯梗可以參考這篇湾盒。bash代碼如下:

plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先轉(zhuǎn)到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一個(gè)樣本 IRIS_313-8921 并且這個(gè)樣本在官網(wǎng)給的分類為NA,空值诅妹。所以就需要?jiǎng)h掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf

bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz  #構(gòu)建引索
bcftools index Nipponbare_indel.vcf.gz 

bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz

bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并兩個(gè)vcf文件

plink --vcf combine_base_filtered_indel.vcf --make-bed --out  combine_base_filtered_indel #再轉(zhuǎn)回bed格式 主要是這樣占內(nèi)存小點(diǎn)罚勾,這個(gè)vcf文件有55G ,bed文件只有5.23G

那么就不想用vcftools合并兩個(gè)數(shù)據(jù)集咋辦呢吭狡?有些時(shí)候并不想得到所有位點(diǎn)的信息尖殃,只需要一部分就可以了。其實(shí)可以用R和plink來(lái)實(shí)現(xiàn)合并兩個(gè)數(shù)據(jù)集划煮,并只提取幾個(gè)位點(diǎn)送丰,這種方案內(nèi)存代價(jià)小點(diǎn),且看后續(xù)講解弛秋。

準(zhǔn)備分類文件

群體分類文件如下
原始文件

這個(gè)分類太細(xì)了器躏,給它分成五類就可以了,GJ蟹略、XI登失、admix、aus挖炬、bas揽浙,用excel的分列就能完成,需要把IRIS_313-8921去掉意敛,這個(gè)分類是NA馅巷。當(dāng)然也可以用自定義的分類。

整合成plink能識(shí)別的分類格式草姻,分別是FID,IID,class
plink 格式

突然想起踩到的一個(gè)坑钓猬,plink1.9和2在篩選的時(shí)候無(wú)法識(shí)別樣本名稱帶有"-"字符,所以需要把 分類文件里 "-"換成"_" ,"IRIS_313-8921"換成"IRIS_313_8921"如用 excel 就可以完成撩独,對(duì)了fam文件里也要換敞曹。

目標(biāo)區(qū)域SNP位點(diǎn)提取

首先需要把plink.exe和分類文件(plink_cluster.txt)放到工作目錄下,方便調(diào)用跌榔,也可以把它放到環(huán)境目錄下异雁。

R代碼如下:

#基本參數(shù)
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目標(biāo)區(qū)間
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP數(shù)據(jù)集,輸入多個(gè)就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群體
fst_pop <- c("GJ XI","admix aus bas") #計(jì)算fst需要的群體 需在pop_name出現(xiàn)過(guò)
#質(zhì)控參數(shù)
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目標(biāo)區(qū)域位點(diǎn) 并輸出vcf文件 可以用Tassel 看LD
#查看樣本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了樣本的數(shù)據(jù)集來(lái)keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并數(shù)據(jù)集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色體和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###輸出結(jié)果,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###篩選
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf篩選會(huì)有點(diǎn)問題僧须,還是再轉(zhuǎn)下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群體提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###        system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

畫LD heatmap

畫LD用Tassel就可以話巍耗,上述代碼也生成了vcf格式文件可以直接在Tassel中打開扣甲。這里用下LDheatmap但绕。LDheatmap具體的教程比較多,自由度也比較高锭部,可自行找些教程。當(dāng)然也有在線能畫LD圖的面褐,就是有點(diǎn)麻煩拌禾。這里具體就是如何將vcf或bed格式文件轉(zhuǎn)到LDheatmap可用的格式。R代碼如下:

library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###沒有就結(jié)束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 轉(zhuǎn)換成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###畫圖
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}

結(jié)果

計(jì)算群體遺傳性指標(biāo)

首先推薦用vcftools來(lái)計(jì)算展哭,因?yàn)榉奖闩惹稀5碚撋现拦接胑xcel也可以算。
計(jì)算Fst,Pi,Tajimi'D based on SNP vcf file
Fst詳解(具體計(jì)算步驟)
【群體遺傳學(xué)】 π (pi)的計(jì)算

Fst

需要分成至少兩個(gè)群體 每個(gè)群體一個(gè)文件匪傍,可以放所有的5個(gè)分類群體您市,也可以放幾個(gè)感興趣的群體。
vcftools 群體文件 只要一列

bash代碼

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###單位點(diǎn)計(jì)算

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows計(jì)算 單位是bp

也可以利用plink 計(jì)算單位點(diǎn)的Fst役衡。

total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果

結(jié)果和vcftools是一樣的

pi

用vcftools計(jì)算 可以單位點(diǎn)計(jì)算 也可以按windows 計(jì)算
一般認(rèn)為按windows更合理茵休,如果研究目標(biāo)是SNP,還是單點(diǎn)算合理手蝎。
在某個(gè)或多個(gè)群體中計(jì)算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###單位點(diǎn)
vcftools --vcf combine_base_filtered_indel.vcf  --window-pi-step 3000  --out combine_base_filtered_indel ###3000個(gè)snp為一個(gè)windows

雙等位的基因 pi值計(jì)算較簡(jiǎn)單榕莺,參考下上面計(jì)算fst的代碼就可以了

total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
pi結(jié)果比對(duì)

結(jié)果基本相同。

Tajima's D 中性檢測(cè)的指標(biāo)

在某個(gè)或多個(gè)群體中計(jì)算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp計(jì)算

這個(gè)用R計(jì)算有點(diǎn)麻煩棵介,但是用Tassel計(jì)算很方便钉鸯,把vcf輸入就可以了。


Tassel

單倍型分析

這里都拿到SNP數(shù)據(jù)集了鞍时,也就可以進(jìn)行單倍型分析了亏拉,這里用的是Haploview扣蜻,要先裝java環(huán)境(但不支持1.8以上版本)逆巍,有GUI比較友好。輸入格式可以用plink格式(ped/map)莽使。
但Haploview只能做SNP的分析锐极,數(shù)量也不能太多。用前面的步驟生成了ped/info 文件芳肌,按linkage format 輸入就可以了灵再。
這里是利用了admix群體的48個(gè)SNP生成的結(jié)果。

LD

單倍型

但是我想要畫單倍型的地圖亿笤,這個(gè)軟件不太方便翎迁,最后我選擇了用beagle+dnasp+popart來(lái)做單倍型分型和畫圖。
beagle:用做單倍型推斷和填充净薛,需要java環(huán)境汪榔。
dnasp:用作單倍型分型,生成nex文件肃拜。
popart:用作單倍型畫圖痴腌,如果有地理信息雌团,也可以加上去。
由于單倍型分析不允許有缺失士聪,所以需要基因型填充锦援,可以利用上面的vcf文件作為輸入文件,輸出為vcf.gz剥悟,解壓后得到的vcf文件可以到后續(xù)利用灵寺。定相之后vcf里的"/"會(huì)變?yōu)?|"。另一個(gè)原因是3K水稻數(shù)據(jù)中并沒有說(shuō)明bed中的SNP是定相的区岗,這里為了保險(xiǎn)一點(diǎn)替久。
填充后的vcf

在powershell中輸入命令

# gt 是需要的vcf文件,按基因位置分割 out 是輸出文件前綴
java -Xss5m   -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased

得到定相后的vcf文件然后生成fasta文件躏尉,由于dnasp的輸入fasta文件需要等長(zhǎng),這里對(duì)有indel的地方加了"-"蚯根。每個(gè)樣本能生成兩條單鏈。
這里用R解決

phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
g <- "g" #輸出前綴

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}
fasta

dnasp

輸入fasta后胀糜,Generate 選Haplotype 生成單倍型颅拦,得到單倍型文件(nex)


結(jié)果

這里一共是得到315個(gè)單倍型,但大部分都只有1個(gè)樣本教藻。

由于需要統(tǒng)計(jì)樣本的信息距帅,這里把nex的freq字段另存為test.txt(),用來(lái)做統(tǒng)計(jì)括堤,并生成TraitLabels字段碌秸,這里需要樣本的分類信息plink_cluster.txt。

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存結(jié)果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
freq字段
生成文件

把生成的文件全部?jī)?nèi)容貼到dnasp生成nex文件底部悄窃。
但我在用popart的時(shí)后讥电,文件還是導(dǎo)不進(jìn)去。發(fā)現(xiàn)需要在nex文件刪掉一段轧抗。



然后就用popart可視化就好了恩敌。

315個(gè)單倍型網(wǎng)絡(luò)圖

這個(gè)太多了,手動(dòng)去了下横媚,只保留17個(gè)單倍型(修改TAXA,CHARACTERS,TRAITS三個(gè)字段)纠炮。

17個(gè)單倍型網(wǎng)絡(luò)

也可以把分類信息改成地理信息


單倍型地理分布圖

我發(fā)現(xiàn)popart計(jì)算出的中性突變系數(shù)和Tassel是一樣的。


單倍型統(tǒng)計(jì)

都有單倍型信息了也順便做下方差分析好了灯蝴。
這里用RFGB下載到的GL表型試一下恢口。
正常的關(guān)聯(lián)分析應(yīng)該用的是混合線性模型。


Statistical methods of gcHap-based GWAS

但這里只有一個(gè)基因上的單倍型信息穷躁,用整體的協(xié)方差或親緣關(guān)系矩陣會(huì)把模型變復(fù)雜耕肩,就用簡(jiǎn)單點(diǎn)的方差分析好了。(這方法有待商榷)

pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #單倍型文件
thr <- 50 #去掉樣本少的單倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齊的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡數(shù)據(jù)的多重比較
out$groups 
plot(out) 
方差分析結(jié)果

簡(jiǎn)單的方差分析可得hap_101的粒長(zhǎng)較長(zhǎng)。

做個(gè)總結(jié)

上述主要是提供了一個(gè)在所有7.1m的SNP數(shù)據(jù)集中看疗,提取部分位點(diǎn)的標(biāo)記沙峻,并計(jì)算LD和一些群體遺傳指標(biāo),基本上是R+plink就解決了两芳。
當(dāng)然如果手頭上有一些品種的重測(cè)序數(shù)據(jù)摔寨,也可以和3k水稻的SNP數(shù)據(jù)集放到一起進(jìn)行比對(duì),相信這個(gè)數(shù)據(jù)集還有很多信息沒有被挖掘怖辆,共勉是复。

R代碼匯總

提取位點(diǎn),群體指標(biāo)計(jì)算

#基本參數(shù)
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目標(biāo)區(qū)間
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP數(shù)據(jù)集,輸入多個(gè)就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群體
fst_pop <- c("GJ XI","admix aus bas") #計(jì)算fst需要的群體 需在pop_name出現(xiàn)過(guò)
#質(zhì)控參數(shù)
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)

###
#提取目標(biāo)區(qū)域位點(diǎn) 并輸出vcf文件 可以用Tassel 看LD
#查看樣本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了樣本的數(shù)據(jù)集來(lái)keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并數(shù)據(jù)集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色體和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###輸出結(jié)果竖螃,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###篩選
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf篩選會(huì)有點(diǎn)問題淑廊,還是再轉(zhuǎn)下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群體提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###       system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###沒有就結(jié)束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 轉(zhuǎn)換成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###畫圖
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}
###
total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果

###
total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
###

vcf to fasta

phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

nex traits 文件生成

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存結(jié)果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

方差分析

pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡數(shù)據(jù)的多重比較
out$groups
plot(out) 

吐槽一下,簡(jiǎn)書的這個(gè)界面太不好寫了特咆,也不好看季惩。代碼中難免會(huì)有多個(gè)空格,逗號(hào)不對(duì)啥的腻格,請(qǐng)見諒画拾。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市菜职,隨后出現(xiàn)的幾起案子青抛,更是在濱河造成了極大的恐慌,老刑警劉巖酬核,帶你破解...
    沈念sama閱讀 216,324評(píng)論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件蜜另,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡嫡意,警方通過(guò)查閱死者的電腦和手機(jī)举瑰,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,356評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)鹅很,“玉大人嘶居,你說(shuō)我怎么就攤上這事〈僦螅” “怎么了?”我有些...
    開封第一講書人閱讀 162,328評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵整袁,是天一觀的道長(zhǎng)菠齿。 經(jīng)常有香客問我,道長(zhǎng)坐昙,這世上最難降的妖魔是什么绳匀? 我笑而不...
    開封第一講書人閱讀 58,147評(píng)論 1 292
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上疾棵,老公的妹妹穿的比我還像新娘戈钢。我一直安慰自己,他們只是感情好是尔,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,160評(píng)論 6 388
  • 文/花漫 我一把揭開白布殉了。 她就那樣靜靜地躺著,像睡著了一般拟枚。 火紅的嫁衣襯著肌膚如雪薪铜。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,115評(píng)論 1 296
  • 那天恩溅,我揣著相機(jī)與錄音隔箍,去河邊找鬼。 笑死脚乡,一個(gè)胖子當(dāng)著我的面吹牛蜒滩,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播奶稠,決...
    沈念sama閱讀 40,025評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼帮掉,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了窒典?” 一聲冷哼從身側(cè)響起蟆炊,我...
    開封第一講書人閱讀 38,867評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎瀑志,沒想到半個(gè)月后涩搓,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,307評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡劈猪,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,528評(píng)論 2 332
  • 正文 我和宋清朗相戀三年昧甘,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片战得。...
    茶點(diǎn)故事閱讀 39,688評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡充边,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出常侦,到底是詐尸還是另有隱情浇冰,我是刑警寧澤,帶...
    沈念sama閱讀 35,409評(píng)論 5 343
  • 正文 年R本政府宣布聋亡,位于F島的核電站肘习,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏坡倔。R本人自食惡果不足惜漂佩,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,001評(píng)論 3 325
  • 文/蒙蒙 一脖含、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧投蝉,春花似錦养葵、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,657評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至咳榜,卻和暖如春夏醉,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背涌韩。 一陣腳步聲響...
    開封第一講書人閱讀 32,811評(píng)論 1 268
  • 我被黑心中介騙來(lái)泰國(guó)打工畔柔, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人臣樱。 一個(gè)月前我還...
    沈念sama閱讀 47,685評(píng)論 2 368
  • 正文 我出身青樓靶擦,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親雇毫。 傳聞我的和親對(duì)象是個(gè)殘疾皇子玄捕,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,573評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容