3K水稻SNP數(shù)據(jù)集的簡(jiǎn)單利用

3010份亞洲稻群體重測(cè)序項(xiàng)目是由中國(guó)農(nóng)業(yè)科學(xué)院作物科學(xué)研究所牽頭误褪，聯(lián)合國(guó)際水稻研究所春叫、華大基因等16家單位共同完成肩钠。該研究對(duì)來(lái)自89個(gè)國(guó)家的3,010份水稻品種進(jìn)行了重測(cè)序研究（resequence）泣港，參考的是日本晴（Nipponbare）基因組。所有3,010個(gè)基因組的平均測(cè)序深度（average sequencing depth）為14×,平均基因組覆蓋率（average genome coverage）和繪圖率（average mapping rate）分別為94.0％和92.5％价匠。
該項(xiàng)目的意義自然不用多說(shuō)当纱，并且該項(xiàng)目的測(cè)序數(shù)據(jù)都可以在國(guó)際水稻所或CNGB上下載。我們可以從原始序列開始組裝得到大的結(jié)構(gòu)變異或call到一些稀有SNP來(lái)做研究踩窖。當(dāng)然以上想法不在本文討論范圍內(nèi)坡氯，本文主要討論只想知道幾個(gè)基因區(qū)域上 SNP 及其分化系數(shù)等群體指標(biāo)的計(jì)算。

2021.09.15 更新了單倍型分型洋腮、作圖和方差分析箫柳。

放在前面

RFGB可以極大程度上滿足查找某個(gè)基因上的SNP&indel和單倍型查找。并提供15個(gè)表型可以進(jìn)行單倍型分析啥供，也可以上傳表型進(jìn)行分析悯恍。接下來(lái)探索一下。

單倍型分析

這個(gè)頁(yè)面提供單倍型分析伙狐。分別可以用基因名涮毫，染色體位置和特定的SNP來(lái)進(jìn)行分析。提供MAF是最小等位基因頻率贷屎，建議選一下罢防。可以用全部的3024個(gè)品種進(jìn)行查找也可以分亞群進(jìn)行查找,也提供了表型關(guān)聯(lián)分析唉侄。

結(jié)果

這里是用的LOC_Os08g33530 CDS 區(qū)域咒吐，選的秈稻群體，共找到20個(gè)SNP属划，和16個(gè)單倍型恬叹。其中hap11的谷粒長(zhǎng)最高。
這個(gè)網(wǎng)站用來(lái)做單倍型都是SNP榴嗅，而且是復(fù)等位的妄呕，就是沒找到在哪下數(shù)據(jù)集。
可惜沒有LD的圖和pi值嗽测。
但我找到了水稻單倍型的數(shù)據(jù)绪励，具體可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
數(shù)據(jù)地址

也可以用這個(gè)單倍型數(shù)據(jù)和表型做關(guān)聯(lián)分析，用方差分析就可以唠粥。表型數(shù)據(jù)下載

以下主要是3k構(gòu)建的SNP的探索以及簡(jiǎn)單利用疏魏。

工具準(zhǔn)備

R
R package(LDheatmap,genetics)
plink1.9(3個(gè)操作系統(tǒng)版本都有)下載地址
vcftools(只有l(wèi)iunx版本，只是用來(lái)算群體遺傳統(tǒng)計(jì))下載地址

數(shù)據(jù)下載

可以在IRRI上下載到雙等位SNP&indel數(shù)據(jù)（多等位的SNP貌似只能從頭call晤愧，沒找到下載的地方)大莫。bed、bim官份、fam文件為一組只厘，bed 是存放位點(diǎn)信息的二進(jìn)制格式文件,用文本打開會(huì)發(fā)現(xiàn)是亂碼烙丛；bim是存放位點(diǎn)信息的文件，類似于map格式羔味；fam是存放樣本信息的文件河咽，第一列為FID,第二列為IID。

bim

fam

Nipponbare_indel 是雙等位indel數(shù)據(jù)集赋元，共有2.3m個(gè)忘蟹，但是是沒有經(jīng)過(guò)篩選的，后續(xù)需要加工下搁凸。
雙等位的SNP標(biāo)記共5個(gè)媚值。
NB_final_snp 包含了所有的SNP，共29m個(gè)护糖，是沒有經(jīng)過(guò)篩選的褥芒。
Base 的SNP是經(jīng)過(guò)基礎(chǔ)篩選的，共18m個(gè)椅文。
base_filtered_v0.7 是從Base的SNP進(jìn)一步篩選的喂很，共4.2m個(gè)惜颇，本文主要利用這個(gè)數(shù)據(jù)集皆刺。
pruned_v2.1 是2.1版本的篩選方案，包含1m個(gè)SNP凌摄。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基礎(chǔ)上進(jìn)一步篩選共 404k 個(gè)羡蛾，在用的時(shí)候發(fā)現(xiàn)很多基因區(qū)域上都沒有SNP，畢竟個(gè)數(shù)少锨亏，比較稀疏痴怨。
各個(gè)數(shù)據(jù)集的篩選方法可以看看里邊的readme文件，不同的篩選方案主要目的就是提高精度器予，找到因果變異浪藻。

準(zhǔn)備目標(biāo)基因的位置

其實(shí)只需要染色體號(hào)和起始位置就好了，基因名字寫啥都行乾翔，就是拿來(lái)做圖片標(biāo)題的爱葵。忘記提了，這個(gè)數(shù)據(jù)集中沒有線粒體和葉綠體上的標(biāo)記反浓。建議起始位置可以把上游和下游都囊括進(jìn)去萌丈，多個(gè)2kb，3kb 都是可以的雷则，多找?guī)讉€(gè)總是好的辆雾，萬(wàn)一和表型關(guān)聯(lián)上了呢。

基因列表月劈，以tab分割

合并想要的SNP數(shù)據(jù)集&indel數(shù)據(jù)集

其實(shí)本可以一個(gè)個(gè)數(shù)據(jù)集來(lái)度迂，但我想把2.3m個(gè)indel和4.2m個(gè)SNP標(biāo)記合在一起分析咋辦呢藤乙。最簡(jiǎn)單的辦法還是用vcftools或bcftools合并兩個(gè)數(shù)據(jù)集〔涯梗可以參考這篇湾盒。bash代碼如下：

plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先轉(zhuǎn)到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一個(gè)樣本 IRIS_313-8921 并且這個(gè)樣本在官網(wǎng)給的分類為NA，空值诅妹。所以就需要?jiǎng)h掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf

bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz  #構(gòu)建引索
bcftools index Nipponbare_indel.vcf.gz 

bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz

bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并兩個(gè)vcf文件

plink --vcf combine_base_filtered_indel.vcf --make-bed --out  combine_base_filtered_indel #再轉(zhuǎn)回bed格式 主要是這樣占內(nèi)存小點(diǎn)罚勾，這個(gè)vcf文件有55G ，bed文件只有5.23G

那么就不想用vcftools合并兩個(gè)數(shù)據(jù)集咋辦呢吭狡？有些時(shí)候并不想得到所有位點(diǎn)的信息尖殃，只需要一部分就可以了。其實(shí)可以用R和plink來(lái)實(shí)現(xiàn)合并兩個(gè)數(shù)據(jù)集划煮，并只提取幾個(gè)位點(diǎn)送丰，這種方案內(nèi)存代價(jià)小點(diǎn)，且看后續(xù)講解弛秋。

準(zhǔn)備分類文件

群體分類文件如下

原始文件

這個(gè)分類太細(xì)了器躏，給它分成五類就可以了，GJ蟹略、XI登失、admix、aus挖炬、bas揽浙，用excel的分列就能完成，需要把IRIS_313-8921去掉意敛，這個(gè)分類是NA馅巷。當(dāng)然也可以用自定義的分類。

整合成plink能識(shí)別的分類格式草姻，分別是FID,IID,class

plink 格式

突然想起踩到的一個(gè)坑钓猬，plink1.9和2在篩選的時(shí)候無(wú)法識(shí)別樣本名稱帶有"-"字符，所以需要把分類文件里 "-"換成"_" ,"IRIS_313-8921"換成"IRIS_313_8921"如用 excel 就可以完成撩独，對(duì)了fam文件里也要換敞曹。

目標(biāo)區(qū)域SNP位點(diǎn)提取

首先需要把plink.exe和分類文件(plink_cluster.txt)放到工作目錄下，方便調(diào)用跌榔，也可以把它放到環(huán)境目錄下异雁。

R代碼如下：

#基本參數(shù)
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目標(biāo)區(qū)間
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP數(shù)據(jù)集,輸入多個(gè)就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群體
fst_pop <- c("GJ XI","admix aus bas") #計(jì)算fst需要的群體 需在pop_name出現(xiàn)過(guò)
#質(zhì)控參數(shù)
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目標(biāo)區(qū)域位點(diǎn) 并輸出vcf文件 可以用Tassel 看LD
#查看樣本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了樣本的數(shù)據(jù)集來(lái)keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并數(shù)據(jù)集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色體和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###輸出結(jié)果，vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###篩選
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf篩選會(huì)有點(diǎn)問題僧须，還是再轉(zhuǎn)下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群體提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###        system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

畫LD heatmap

畫LD用Tassel就可以話巍耗，上述代碼也生成了vcf格式文件可以直接在Tassel中打開扣甲。這里用下LDheatmap但绕。LDheatmap具體的教程比較多，自由度也比較高锭部，可自行找些教程。當(dāng)然也有在線能畫LD圖的面褐，就是有點(diǎn)麻煩拌禾。這里具體就是如何將vcf或bed格式文件轉(zhuǎn)到LDheatmap可用的格式。R代碼如下：

library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###沒有就結(jié)束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 轉(zhuǎn)換成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###畫圖
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}

結(jié)果

計(jì)算群體遺傳性指標(biāo)

首先推薦用vcftools來(lái)計(jì)算展哭，因?yàn)榉奖闩惹稀５碚撋现拦接胑xcel也可以算。
計(jì)算Fst,Pi,Tajimi'D based on SNP vcf file
Fst詳解（具體計(jì)算步驟）
【群體遺傳學(xué)】 π （pi）的計(jì)算

Fst

需要分成至少兩個(gè)群體每個(gè)群體一個(gè)文件匪傍，可以放所有的5個(gè)分類群體您市，也可以放幾個(gè)感興趣的群體。

vcftools 群體文件只要一列

bash代碼

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###單位點(diǎn)計(jì)算

vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows計(jì)算 單位是bp

也可以利用plink 計(jì)算單位點(diǎn)的Fst役衡。

total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果

結(jié)果和vcftools是一樣的

pi

用vcftools計(jì)算可以單位點(diǎn)計(jì)算也可以按windows 計(jì)算
一般認(rèn)為按windows更合理茵休，如果研究目標(biāo)是SNP，還是單點(diǎn)算合理手蝎。
在某個(gè)或多個(gè)群體中計(jì)算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###單位點(diǎn)
vcftools --vcf combine_base_filtered_indel.vcf  --window-pi-step 3000  --out combine_base_filtered_indel ###3000個(gè)snp為一個(gè)windows

雙等位的基因 pi值計(jì)算較簡(jiǎn)單榕莺，參考下上面計(jì)算fst的代碼就可以了

total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果

pi結(jié)果比對(duì)

結(jié)果基本相同。

Tajima's D 中性檢測(cè)的指標(biāo)

在某個(gè)或多個(gè)群體中計(jì)算就加 --keep pop_GJ.txt --keep pop_XI.txt

vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp計(jì)算

這個(gè)用R計(jì)算有點(diǎn)麻煩棵介，但是用Tassel計(jì)算很方便钉鸯，把vcf輸入就可以了。

Tassel

單倍型分析

這里都拿到SNP數(shù)據(jù)集了鞍时，也就可以進(jìn)行單倍型分析了亏拉，這里用的是Haploview扣蜻，要先裝java環(huán)境（但不支持1.8以上版本）逆巍，有GUI比較友好。輸入格式可以用plink格式(ped/map)莽使。
但Haploview只能做SNP的分析锐极，數(shù)量也不能太多。用前面的步驟生成了ped/info 文件芳肌，按linkage format 輸入就可以了灵再。
這里是利用了admix群體的48個(gè)SNP生成的結(jié)果。

單倍型

但是我想要畫單倍型的地圖亿笤，這個(gè)軟件不太方便翎迁，最后我選擇了用beagle+dnasp+popart來(lái)做單倍型分型和畫圖。
beagle：用做單倍型推斷和填充净薛，需要java環(huán)境汪榔。
dnasp：用作單倍型分型，生成nex文件肃拜。
popart：用作單倍型畫圖痴腌，如果有地理信息雌团，也可以加上去。
由于單倍型分析不允許有缺失士聪，所以需要基因型填充锦援，可以利用上面的vcf文件作為輸入文件，輸出為vcf.gz剥悟，解壓后得到的vcf文件可以到后續(xù)利用灵寺。定相之后vcf里的"/"會(huì)變?yōu)?|"。另一個(gè)原因是3K水稻數(shù)據(jù)中并沒有說(shuō)明bed中的SNP是定相的区岗，這里為了保險(xiǎn)一點(diǎn)替久。

填充后的vcf

在powershell中輸入命令

# gt 是需要的vcf文件，按基因位置分割 out 是輸出文件前綴
java -Xss5m   -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased

得到定相后的vcf文件然后生成fasta文件躏尉，由于dnasp的輸入fasta文件需要等長(zhǎng),這里對(duì)有indel的地方加了"-"蚯根。每個(gè)樣本能生成兩條單鏈。
這里用R解決

phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
g <- "g" #輸出前綴

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

fasta

dnasp

輸入fasta后胀糜，Generate 選Haplotype 生成單倍型颅拦，得到單倍型文件(nex)

結(jié)果

這里一共是得到315個(gè)單倍型，但大部分都只有1個(gè)樣本教藻。

由于需要統(tǒng)計(jì)樣本的信息距帅，這里把nex的freq字段另存為test.txt()，用來(lái)做統(tǒng)計(jì)括堤，并生成TraitLabels字段碌秸，這里需要樣本的分類信息plink_cluster.txt。

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存結(jié)果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

freq字段

生成文件

把生成的文件全部?jī)?nèi)容貼到dnasp生成nex文件底部悄窃。
但我在用popart的時(shí)后讥电，文件還是導(dǎo)不進(jìn)去。發(fā)現(xiàn)需要在nex文件刪掉一段轧抗。

前

后

然后就用popart可視化就好了恩敌。

315個(gè)單倍型網(wǎng)絡(luò)圖

這個(gè)太多了，手動(dòng)去了下横媚，只保留17個(gè)單倍型(修改TAXA,CHARACTERS,TRAITS三個(gè)字段)纠炮。

17個(gè)單倍型網(wǎng)絡(luò)

也可以把分類信息改成地理信息

單倍型地理分布圖

我發(fā)現(xiàn)popart計(jì)算出的中性突變系數(shù)和Tassel是一樣的。

單倍型統(tǒng)計(jì)

都有單倍型信息了也順便做下方差分析好了灯蝴。
這里用RFGB下載到的GL表型試一下恢口。
正常的關(guān)聯(lián)分析應(yīng)該用的是混合線性模型。

Statistical methods of gcHap-based GWAS

但這里只有一個(gè)基因上的單倍型信息穷躁，用整體的協(xié)方差或親緣關(guān)系矩陣會(huì)把模型變復(fù)雜耕肩，就用簡(jiǎn)單點(diǎn)的方差分析好了。（這方法有待商榷）

pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #單倍型文件
thr <- 50 #去掉樣本少的單倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齊的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡數(shù)據(jù)的多重比較
out$groups 
plot(out)

方差分析結(jié)果

簡(jiǎn)單的方差分析可得hap_101的粒長(zhǎng)較長(zhǎng)。

做個(gè)總結(jié)

上述主要是提供了一個(gè)在所有7.1m的SNP數(shù)據(jù)集中看疗，提取部分位點(diǎn)的標(biāo)記沙峻，并計(jì)算LD和一些群體遺傳指標(biāo)，基本上是R+plink就解決了两芳。
當(dāng)然如果手頭上有一些品種的重測(cè)序數(shù)據(jù)摔寨，也可以和3k水稻的SNP數(shù)據(jù)集放到一起進(jìn)行比對(duì)，相信這個(gè)數(shù)據(jù)集還有很多信息沒有被挖掘怖辆，共勉是复。

R代碼匯總

提取位點(diǎn)，群體指標(biāo)計(jì)算

#基本參數(shù)
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt"  #目標(biāo)區(qū)間
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP數(shù)據(jù)集,輸入多個(gè)就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群體
fst_pop <- c("GJ XI","admix aus bas") #計(jì)算fst需要的群體 需在pop_name出現(xiàn)過(guò)
#質(zhì)控參數(shù)
maf <- 0.05 
miss <- 0.4

color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD

target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)

###
#提取目標(biāo)區(qū)域位點(diǎn) 并輸出vcf文件 可以用Tassel 看LD
#查看樣本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
    temp <- read.table(paste0(data_name[i],".fam"),header = F)
    temp <- nrow(temp)
    tem <- c(tem,temp)
}
#用少了樣本的數(shù)據(jù)集來(lái)keep
keep_file <- data_name[which(tem %in% min(tem))]

for (i in 1:length(data_name)) {
    total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
    system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')

if (length(data_name)>=2){    #合并數(shù)據(jù)集
    for (i in 2:length(data_name)) {
        temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
        temp_all<-rbind(temp_all ,temp )
    }
    temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色體和POS排序
}

### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM",   "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###輸出結(jié)果竖螃，vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###篩選
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf篩選會(huì)有點(diǎn)問題淑廊，還是再轉(zhuǎn)下bed格式
system(total_cmd,intern = FALSE)

temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

#不同群體提取 按基因提取 
for(g in target_gene$GENE){
    extract <-dplyr::filter(target_gene ,GENE==g)
    extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
    write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)
    
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
    system(total_cmd,intern = FALSE)

    for (p in pop_name) {
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
        system(total_cmd,intern = FALSE)
###        total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt  --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
###       system(total_cmd,intern = FALSE)
    }
}

for (p in pop_name) {
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt  --keep-cluster-names ",p,"  --make-bed  --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)

plot_LD <- function(vcf_file_name,title){
    temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
    if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###沒有就結(jié)束

    temp_snp <- temp_vcf[0,]
    for (i in 1:nrow(temp_vcf)) {
        temp <- temp_vcf[i,]
        for(j in 10:ncol(temp_vcf)){
            if(temp_vcf[i,j]=="0/0"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
                next
            }
            if(temp_vcf[i,j]=="0/1"){
                temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="1/1"){
                temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
                next
            }
            if(temp_vcf[i,j]=="./."){
                temp[1,j] <- NA
                next
            }
        }
        temp_snp <- rbind(temp_snp,temp)
    }

    SNPpos <- temp_snp$POS
    SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))

    ### 轉(zhuǎn)換成LDheatmap能用的格式
    for(i in 1:ncol(SNP)){
        SNP[,i]<-as.genotype(SNP[,i])
    }
    ###畫圖
    LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}

for(g in target_gene$GENE){
    plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
    for (p in pop_name) {
        plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
    }
}
###
total_pop<-length(fst_pop)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt","  --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in fst_pop){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt  --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST

for (i in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
    fst[,i+1] <- temp$FST
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)

names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果

###
total_pop<-length(pop_name)

total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)

###
for(p in pop_name){
    total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
    system(total_cmd,intern = FALSE)
}

temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)

for (j in 1:total_pop) {
    temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
    pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}

temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)

names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
###

vcf to fasta

phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g

vcf <- data.table::fread(phase_file,header = T)

temp <- c()
for (i in 1:nrow(vcf)) {
    temp_value <- vcf[i,10:ncol(vcf)]
    ref <- vcf[i,4]
    alt <- vcf[i,5]
    if(nchar(alt)>nchar(ref)){
    alt <- substr(alt,2,nchar(alt))
    ref <- paste(rep("-",nchar(alt)),collapse = "")
    }
    if(nchar(ref)>nchar(alt)){
        ref <- substr(ref,2,nchar(ref))
        alt <- paste(rep("-",nchar(ref)),collapse = "")
    }
    temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
    temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
    temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
    temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
    temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)

cat("",file=paste0(out_name,g,".fasta"),append = FALSE)

for (i in 1:ncol(temp)) {
    c1 <- c()
    c2 <- c()
    for (j in 1:nrow(temp)) {
        c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
        c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
    }
    cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
    cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}

nex traits 文件生成

clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])

clut_hap <- clust[,2:3]

tem <- as.data.frame(strsplit(hap[,2],"    "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])

for (i in 1:nrow(clust)) {
    flag <- 0
    v1 <- 0
    v2 <- 0
    for (j in 1:nrow(hap)) {
        if (flag==2) break
        if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
            idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
            if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
            if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
        }
    }
    clut_hap[i,3] <- v1
    clut_hap[i,4] <- v2
}

write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存結(jié)果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)

out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp) 
names(out_matrix) <- gsub(" ","_",colnames(temp))

out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]

cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)

方差分析

pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <-  as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]

library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡數(shù)據(jù)的多重比較
out$groups
plot(out)

吐槽一下，簡(jiǎn)書的這個(gè)界面太不好寫了特咆，也不好看季惩。代碼中難免會(huì)有多個(gè)空格，逗號(hào)不對(duì)啥的腻格，請(qǐng)見諒画拾。

最后編輯于：2021.10.10 22:29:23

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市菜职，隨后出現(xiàn)的幾起案子青抛，更是在濱河造成了極大的恐慌，老刑警劉巖酬核，帶你破解...
沈念sama閱讀 216,324評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件蜜另，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡嫡意，警方通過(guò)查閱死者的電腦和手機(jī)举瑰，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,356評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)鹅很，“玉大人嘶居，你說(shuō)我怎么就攤上這事〈僦螅” “怎么了？”我有些...
開封第一講書人閱讀 162,328評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵整袁，是天一觀的道長(zhǎng)菠齿。經(jīng)常有香客問我，道長(zhǎng)坐昙，這世上最難降的妖魔是什么绳匀？我笑而不...
開封第一講書人閱讀 58,147評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上疾棵，老公的妹妹穿的比我還像新娘戈钢。我一直安慰自己，他們只是感情好是尔，可當(dāng)我...
茶點(diǎn)故事閱讀 67,160評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開白布殉了。她就那樣靜靜地躺著，像睡著了一般拟枚。火紅的嫁衣襯著肌膚如雪薪铜。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,115評(píng)論 1贊 296
城市分裂傳說(shuō)
那天恩溅，我揣著相機(jī)與錄音隔箍，去河邊找鬼。笑死脚乡，一個(gè)胖子當(dāng)著我的面吹牛蜒滩，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播奶稠，決...
沈念sama閱讀 40,025評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼帮掉，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了窒典？” 一聲冷哼從身側(cè)響起蟆炊，我...
開封第一講書人閱讀 38,867評(píng)論 0贊 274
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎瀑志，沒想到半個(gè)月后涩搓，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,307評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡劈猪，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,528評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年昧甘，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片战得。...
茶點(diǎn)故事閱讀 39,688評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡充边，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出常侦，到底是詐尸還是另有隱情浇冰，我是刑警寧澤，帶...
沈念sama閱讀 35,409評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布聋亡，位于F島的核電站肘习，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏坡倔。R本人自食惡果不足惜漂佩，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,001評(píng)論 3贊 325
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一脖含、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧投蝉，春花似錦养葵、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,657評(píng)論 0贊 22
一樁弒父案关拒，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至咳榜，卻和暖如春夏醉，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背涌韩。一陣腳步聲響...
開封第一講書人閱讀 32,811評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工畔柔，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人臣樱。一個(gè)月前我還...
沈念sama閱讀 47,685評(píng)論 2贊 368
代替公主和親
正文我出身青樓靶擦，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親雇毫。傳聞我的和親對(duì)象是個(gè)殘疾皇子玄捕，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,573評(píng)論 2贊 353

3K水稻SNP數(shù)據(jù)集的簡(jiǎn)單利用

放在前面

工具準(zhǔn)備

數(shù)據(jù)下載

準(zhǔn)備目標(biāo)基因的位置

合并想要的SNP數(shù)據(jù)集&indel數(shù)據(jù)集

準(zhǔn)備分類文件

目標(biāo)區(qū)域SNP位點(diǎn)提取

畫LD heatmap

計(jì)算群體遺傳性指標(biāo)

Fst

pi

Tajima's D 中性檢測(cè)的指標(biāo)

單倍型分析

做個(gè)總結(jié)

R代碼匯總

推薦閱讀更多精彩內(nèi)容