3010份亞洲稻群體重測(cè)序項(xiàng)目是由中國(guó)農(nóng)業(yè)科學(xué)院作物科學(xué)研究所牽頭误褪,聯(lián)合國(guó)際水稻研究所春叫、華大基因等16家單位共同完成肩钠。該研究對(duì)來(lái)自89個(gè)國(guó)家的3,010份水稻品種進(jìn)行了重測(cè)序研究(resequence)泣港,參考的是日本晴(Nipponbare)基因組。所有3,010個(gè)基因組的平均測(cè)序深度(average sequencing depth)為14×,平均基因組覆蓋率(average genome coverage)和繪圖率(average mapping rate)分別為94.0%和92.5%价匠。
該項(xiàng)目的意義自然不用多說(shuō)当纱,并且該項(xiàng)目的測(cè)序數(shù)據(jù)都可以在國(guó)際水稻所或CNGB上下載。我們可以從原始序列開始組裝得到大的結(jié)構(gòu)變異 或call到一些稀有SNP來(lái)做研究踩窖。當(dāng)然以上想法不在本文討論范圍內(nèi)坡氯,本文主要討論只想知道幾個(gè)基因區(qū)域上 SNP 及其分化系數(shù)等群體指標(biāo)的計(jì)算。
2021.09.15 更新了單倍型分型洋腮、作圖和方差分析箫柳。
放在前面
RFGB可以極大程度上滿足查找某個(gè)基因上的SNP&indel和單倍型查找。并提供15個(gè)表型可以進(jìn)行單倍型分析啥供,也可以上傳表型進(jìn)行分析悯恍。接下來(lái)探索一下。
這個(gè)頁(yè)面提供單倍型分析伙狐。分別可以用基因名涮毫,染色體位置和特定的SNP來(lái)進(jìn)行分析。提供MAF是最小等位基因頻率贷屎,建議選一下罢防。可以用全部的3024個(gè)品種進(jìn)行查找也可以分亞群進(jìn)行查找,也提供了表型關(guān)聯(lián)分析唉侄。
這里是用的LOC_Os08g33530 CDS 區(qū)域咒吐,選的秈稻群體,共找到20個(gè)SNP属划,和16個(gè)單倍型恬叹。其中hap11的谷粒長(zhǎng)最高。
這個(gè)網(wǎng)站用來(lái)做單倍型都是SNP榴嗅,而且是復(fù)等位的妄呕,就是沒找到在哪下數(shù)據(jù)集。
可惜沒有LD的圖和pi值嗽测。
但我找到了水稻單倍型的數(shù)據(jù)绪励,具體可看文章
The landscape of gene–CDS–haplotype diversity in rice: Properties, population organization, footprints of domestication and breeding, and implications for genetic improvement
數(shù)據(jù)地址
也可以用這個(gè)單倍型數(shù)據(jù)和表型做關(guān)聯(lián)分析,用方差分析就可以唠粥。表型數(shù)據(jù)下載
以下主要是3k構(gòu)建的SNP的探索以及簡(jiǎn)單利用疏魏。
工具準(zhǔn)備
R
R package(LDheatmap,genetics)
plink1.9(3個(gè)操作系統(tǒng)版本都有)下載地址
vcftools(只有l(wèi)iunx版本,只是用來(lái)算群體遺傳統(tǒng)計(jì))下載地址
數(shù)據(jù)下載
可以在IRRI上下載到雙等位SNP&indel數(shù)據(jù)(多等位的SNP貌似只能從頭call晤愧,沒找到下載的地方)大莫。bed、bim官份、fam文件為一組只厘,bed 是存放位點(diǎn)信息的二進(jìn)制格式文件,用文本打開會(huì)發(fā)現(xiàn)是亂碼烙丛;bim是存放位點(diǎn)信息的文件,類似于map格式羔味;fam是存放樣本信息的文件河咽,第一列為FID,第二列為IID。
Nipponbare_indel 是雙等位indel數(shù)據(jù)集赋元,共有2.3m個(gè)忘蟹,但是是沒有經(jīng)過(guò)篩選的,后續(xù)需要加工下搁凸。
雙等位的SNP標(biāo)記共5個(gè)媚值。
NB_final_snp 包含了所有的SNP,共29m個(gè)护糖,是沒有經(jīng)過(guò)篩選的褥芒。
Base 的SNP是經(jīng)過(guò)基礎(chǔ)篩選的,共18m個(gè)椅文。
base_filtered_v0.7 是從Base的SNP進(jìn)一步篩選的喂很,共4.2m個(gè)惜颇,本文主要利用這個(gè)數(shù)據(jù)集皆刺。
pruned_v2.1 是2.1版本的篩選方案,包含1m個(gè)SNP凌摄。
base_filtered_v0.7_0.8_10kb_1_0.8_50_1 是在base_filtered_v0.7 的基礎(chǔ)上進(jìn)一步篩選 共 404k 個(gè)羡蛾,在用的時(shí)候發(fā)現(xiàn)很多基因區(qū)域上都沒有SNP,畢竟個(gè)數(shù)少锨亏,比較稀疏痴怨。
各個(gè)數(shù)據(jù)集的篩選方法可以看看里邊的readme文件,不同的篩選方案主要目的就是提高精度器予,找到因果變異浪藻。
準(zhǔn)備目標(biāo)基因的位置
其實(shí)只需要染色體號(hào)和起始位置就好了,基因名字寫啥都行乾翔,就是拿來(lái)做圖片標(biāo)題的爱葵。忘記提了,這個(gè)數(shù)據(jù)集中沒有線粒體和葉綠體上的標(biāo)記反浓。建議起始位置可以把上游和下游都囊括進(jìn)去萌丈,多個(gè)2kb,3kb 都是可以的雷则,多找?guī)讉€(gè)總是好的辆雾,萬(wàn)一和表型關(guān)聯(lián)上了呢。合并想要的SNP數(shù)據(jù)集&indel數(shù)據(jù)集
其實(shí)本可以一個(gè)個(gè)數(shù)據(jù)集來(lái)度迂,但我想把2.3m個(gè)indel和4.2m個(gè)SNP標(biāo)記合在一起分析咋辦呢藤乙。最簡(jiǎn)單的辦法還是用vcftools或bcftools合并兩個(gè)數(shù)據(jù)集〔涯梗可以參考這篇湾盒。bash代碼如下:
plink --bfile base_filtered_v0.7 --recode vcf-iid --out base_filtered_v0.7 #先轉(zhuǎn)到vcf格式
plink --bfile Nipponbare_indel --recode vcf-iid --out Nipponbare_indel
###由于Nipponbare_indel 中少一個(gè)樣本 IRIS_313-8921 并且這個(gè)樣本在官網(wǎng)給的分類為NA,空值诅妹。所以就需要?jiǎng)h掉它
vcftools --vcf base_filtered_v0.7.vcf --recode --recode-INFO-all --stdout --indv IRIS_313-8921 > rm_na_base_filtered_v0.7.vcf
bcftools view Nipponbare_indel.vcf -Oz -o Nipponbare_indel.vcf.gz #構(gòu)建引索
bcftools index Nipponbare_indel.vcf.gz
bcftools view rm_na_base_filtered_v0.7.vcf -Oz -o rm_na_base_filtered_v0.7.vcf.gz
bcftools index rm_na_base_filtered_v0.7.vcf.gz
bcftools concat rm_na_base_filtered_v0.7.vcf.gz Nipponbare_indel.vcf.gz -a -O v -o combine_base_filtered_indel.vcf #合并兩個(gè)vcf文件
plink --vcf combine_base_filtered_indel.vcf --make-bed --out combine_base_filtered_indel #再轉(zhuǎn)回bed格式 主要是這樣占內(nèi)存小點(diǎn)罚勾,這個(gè)vcf文件有55G ,bed文件只有5.23G
那么就不想用vcftools合并兩個(gè)數(shù)據(jù)集咋辦呢吭狡?有些時(shí)候并不想得到所有位點(diǎn)的信息尖殃,只需要一部分就可以了。其實(shí)可以用R和plink來(lái)實(shí)現(xiàn)合并兩個(gè)數(shù)據(jù)集划煮,并只提取幾個(gè)位點(diǎn)送丰,這種方案內(nèi)存代價(jià)小點(diǎn),且看后續(xù)講解弛秋。
準(zhǔn)備分類文件
群體分類文件如下這個(gè)分類太細(xì)了器躏,給它分成五類就可以了,GJ蟹略、XI登失、admix、aus挖炬、bas揽浙,用excel的分列就能完成,需要把IRIS_313-8921去掉意敛,這個(gè)分類是NA馅巷。當(dāng)然也可以用自定義的分類。
突然想起踩到的一個(gè)坑钓猬,plink1.9和2在篩選的時(shí)候無(wú)法識(shí)別樣本名稱帶有"-"字符,所以需要把 分類文件里 "-"換成"_" ,"IRIS_313-8921"換成"IRIS_313_8921"如用 excel 就可以完成撩独,對(duì)了fam文件里也要換敞曹。
目標(biāo)區(qū)域SNP位點(diǎn)提取
首先需要把plink.exe和分類文件(plink_cluster.txt)放到工作目錄下,方便調(diào)用跌榔,也可以把它放到環(huán)境目錄下异雁。
R代碼如下:
#基本參數(shù)
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt" #目標(biāo)區(qū)間
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP數(shù)據(jù)集,輸入多個(gè)就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群體
fst_pop <- c("GJ XI","admix aus bas") #計(jì)算fst需要的群體 需在pop_name出現(xiàn)過(guò)
#質(zhì)控參數(shù)
maf <- 0.05
miss <- 0.4
color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD
target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
#提取目標(biāo)區(qū)域位點(diǎn) 并輸出vcf文件 可以用Tassel 看LD
#查看樣本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
temp <- read.table(paste0(data_name[i],".fam"),header = F)
temp <- nrow(temp)
tem <- c(tem,temp)
}
#用少了樣本的數(shù)據(jù)集來(lái)keep
keep_file <- data_name[which(tem %in% min(tem))]
for (i in 1:length(data_name)) {
total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')
if (length(data_name)>=2){ #合并數(shù)據(jù)集
for (i in 2:length(data_name)) {
temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
temp_all<-rbind(temp_all ,temp )
}
temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色體和POS排序
}
### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM", "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###輸出結(jié)果,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###篩選
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf篩選會(huì)有點(diǎn)問題僧须,還是再轉(zhuǎn)下bed格式
system(total_cmd,intern = FALSE)
temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
#不同群體提取 按基因提取
for(g in target_gene$GENE){
extract <-dplyr::filter(target_gene ,GENE==g)
extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --make-bed --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
### total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --make-bed --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
### system(total_cmd,intern = FALSE)
}
}
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt --keep-cluster-names ",p," --make-bed --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
畫LD heatmap
畫LD用Tassel就可以話巍耗,上述代碼也生成了vcf格式文件可以直接在Tassel中打開扣甲。這里用下LDheatmap但绕。LDheatmap具體的教程比較多,自由度也比較高锭部,可自行找些教程。當(dāng)然也有在線能畫LD圖的面褐,就是有點(diǎn)麻煩拌禾。這里具體就是如何將vcf或bed格式文件轉(zhuǎn)到LDheatmap可用的格式。R代碼如下:
library(LDheatmap)
library(genetics)
plot_LD <- function(vcf_file_name,title){
temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###沒有就結(jié)束
temp_snp <- temp_vcf[0,]
for (i in 1:nrow(temp_vcf)) {
temp <- temp_vcf[i,]
for(j in 10:ncol(temp_vcf)){
if(temp_vcf[i,j]=="0/0"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
next
}
if(temp_vcf[i,j]=="0/1"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="1/1"){
temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="./."){
temp[1,j] <- NA
next
}
}
temp_snp <- rbind(temp_snp,temp)
}
SNPpos <- temp_snp$POS
SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))
### 轉(zhuǎn)換成LDheatmap能用的格式
for(i in 1:ncol(SNP)){
SNP[,i]<-as.genotype(SNP[,i])
}
###畫圖
LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}
for(g in target_gene$GENE){
plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
for (p in pop_name) {
plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
}
}
計(jì)算群體遺傳性指標(biāo)
首先推薦用vcftools來(lái)計(jì)算展哭,因?yàn)榉奖闩惹稀5碚撋现拦接胑xcel也可以算。
計(jì)算Fst,Pi,Tajimi'D based on SNP vcf file
Fst詳解(具體計(jì)算步驟)
【群體遺傳學(xué)】 π (pi)的計(jì)算
Fst
需要分成至少兩個(gè)群體 每個(gè)群體一個(gè)文件匪傍,可以放所有的5個(gè)分類群體您市,也可以放幾個(gè)感興趣的群體。bash代碼
vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --out combine_base_filtered_indel.fst ###單位點(diǎn)計(jì)算
vcftools --vcf combine_base_filtered_indel.vcf --weir-fst-pop pop_GJ.txt --weir-fst-pop pop_XI.txt --weir-fst-pop pop_admix.txt --weir-fst-pop pop_aus.txt --weir-fst-pop pop_bas.txt --fst-window-size 5000--out combine_base_filtered_indel.fst ###按windows計(jì)算 單位是bp
也可以利用plink 計(jì)算單位點(diǎn)的Fst役衡。
total_pop<-length(fst_pop)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt"," --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in fst_pop){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST
for (i in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
fst[,i+1] <- temp$FST
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)
names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
結(jié)果和vcftools是一樣的
pi
用vcftools計(jì)算 可以單位點(diǎn)計(jì)算 也可以按windows 計(jì)算
一般認(rèn)為按windows更合理茵休,如果研究目標(biāo)是SNP,還是單點(diǎn)算合理手蝎。
在某個(gè)或多個(gè)群體中計(jì)算就加 --keep pop_GJ.txt --keep pop_XI.txt
vcftools --vcf combine_base_filtered_indel.vcf --site-pi --out combine_base_filtered_indel.pi ###單位點(diǎn)
vcftools --vcf combine_base_filtered_indel.vcf --window-pi-step 3000 --out combine_base_filtered_indel ###3000個(gè)snp為一個(gè)windows
雙等位的基因 pi值計(jì)算較簡(jiǎn)單榕莺,參考下上面計(jì)算fst的代碼就可以了
total_pop<-length(pop_name)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in pop_name){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)
for (j in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)
names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
結(jié)果基本相同。
Tajima's D 中性檢測(cè)的指標(biāo)
在某個(gè)或多個(gè)群體中計(jì)算就加 --keep pop_GJ.txt --keep pop_XI.txt
vcftools --vcf combine_base_filtered_indel.vcf --TajimaD 5000 --out combine_base_filtered_indel.TajimaD ###按5000bp計(jì)算
這個(gè)用R計(jì)算有點(diǎn)麻煩棵介,但是用Tassel計(jì)算很方便钉鸯,把vcf輸入就可以了。
單倍型分析
這里都拿到SNP數(shù)據(jù)集了鞍时,也就可以進(jìn)行單倍型分析了亏拉,這里用的是Haploview扣蜻,要先裝java環(huán)境(但不支持1.8以上版本)逆巍,有GUI比較友好。輸入格式可以用plink格式(ped/map)莽使。
但Haploview只能做SNP的分析锐极,數(shù)量也不能太多。用前面的步驟生成了ped/info 文件芳肌,按linkage format 輸入就可以了灵再。
這里是利用了admix群體的48個(gè)SNP生成的結(jié)果。
但是我想要畫單倍型的地圖亿笤,這個(gè)軟件不太方便翎迁,最后我選擇了用beagle+dnasp+popart來(lái)做單倍型分型和畫圖。
beagle:用做單倍型推斷和填充净薛,需要java環(huán)境汪榔。
dnasp:用作單倍型分型,生成nex文件肃拜。
popart:用作單倍型畫圖痴腌,如果有地理信息雌团,也可以加上去。
由于單倍型分析不允許有缺失士聪,所以需要基因型填充锦援,可以利用上面的vcf文件作為輸入文件,輸出為vcf.gz剥悟,解壓后得到的vcf文件可以到后續(xù)利用灵寺。定相之后vcf里的"/"會(huì)變?yōu)?|"。另一個(gè)原因是3K水稻數(shù)據(jù)中并沒有說(shuō)明bed中的SNP是定相的区岗,這里為了保險(xiǎn)一點(diǎn)替久。
在powershell中輸入命令
# gt 是需要的vcf文件,按基因位置分割 out 是輸出文件前綴
java -Xss5m -jar .\beagle.28Jun21.220.jar gt="C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_target_all.vcf" out=phased
得到定相后的vcf文件然后生成fasta文件躏尉,由于dnasp的輸入fasta文件需要等長(zhǎng),這里對(duì)有indel的地方加了"-"蚯根。每個(gè)樣本能生成兩條單鏈。
這里用R解決
phase_file <- "D:\\beagle\\phased.vcf" #vcf文件位置
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
g <- "g" #輸出前綴
vcf <- data.table::fread(phase_file,header = T)
temp <- c()
for (i in 1:nrow(vcf)) {
temp_value <- vcf[i,10:ncol(vcf)]
ref <- vcf[i,4]
alt <- vcf[i,5]
if(nchar(alt)>nchar(ref)){
alt <- substr(alt,2,nchar(alt))
ref <- paste(rep("-",nchar(alt)),collapse = "")
}
if(nchar(ref)>nchar(alt)){
ref <- substr(ref,2,nchar(ref))
alt <- paste(rep("-",nchar(ref)),collapse = "")
}
temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)
cat("",file=paste0(out_name,g,".fasta"),append = FALSE)
for (i in 1:ncol(temp)) {
c1 <- c()
c2 <- c()
for (j in 1:nrow(temp)) {
c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
}
cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}
輸入fasta后胀糜,Generate 選Haplotype 生成單倍型颅拦,得到單倍型文件(nex)
這里一共是得到315個(gè)單倍型,但大部分都只有1個(gè)樣本教藻。
由于需要統(tǒng)計(jì)樣本的信息距帅,這里把nex的freq字段另存為test.txt(),用來(lái)做統(tǒng)計(jì)括堤,并生成TraitLabels字段碌秸,這里需要樣本的分類信息plink_cluster.txt。
clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])
clut_hap <- clust[,2:3]
tem <- as.data.frame(strsplit(hap[,2]," "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])
for (i in 1:nrow(clust)) {
flag <- 0
v1 <- 0
v2 <- 0
for (j in 1:nrow(hap)) {
if (flag==2) break
if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
}
}
clut_hap[i,3] <- v1
clut_hap[i,4] <- v2
}
write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存結(jié)果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)
out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp)
names(out_matrix) <- gsub(" ","_",colnames(temp))
out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]
cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
把生成的文件全部?jī)?nèi)容貼到dnasp生成nex文件底部悄窃。
但我在用popart的時(shí)后讥电,文件還是導(dǎo)不進(jìn)去。發(fā)現(xiàn)需要在nex文件刪掉一段轧抗。
然后就用popart可視化就好了恩敌。
這個(gè)太多了,手動(dòng)去了下横媚,只保留17個(gè)單倍型(修改TAXA,CHARACTERS,TRAITS三個(gè)字段)纠炮。
也可以把分類信息改成地理信息
我發(fā)現(xiàn)popart計(jì)算出的中性突變系數(shù)和Tassel是一樣的。
都有單倍型信息了也順便做下方差分析好了灯蝴。
這里用RFGB下載到的GL表型試一下恢口。
正常的關(guān)聯(lián)分析應(yīng)該用的是混合線性模型。
但這里只有一個(gè)基因上的單倍型信息穷躁,用整體的協(xié)方差或親緣關(guān)系矩陣會(huì)把模型變復(fù)雜耕肩,就用簡(jiǎn)單點(diǎn)的方差分析好了。(這方法有待商榷)
pheno <- read.table("grain_length.txt",header = F,sep = "\t") #表型文件
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t") #單倍型文件
thr <- 50 #去掉樣本少的單倍型
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <- as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]
library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data) #方差分析
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)#方差不齊的方差分析
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡數(shù)據(jù)的多重比較
out$groups
plot(out)
簡(jiǎn)單的方差分析可得hap_101的粒長(zhǎng)較長(zhǎng)。
做個(gè)總結(jié)
上述主要是提供了一個(gè)在所有7.1m的SNP數(shù)據(jù)集中看疗,提取部分位點(diǎn)的標(biāo)記沙峻,并計(jì)算LD和一些群體遺傳指標(biāo),基本上是R+plink就解決了两芳。
當(dāng)然如果手頭上有一些品種的重測(cè)序數(shù)據(jù)摔寨,也可以和3k水稻的SNP數(shù)據(jù)集放到一起進(jìn)行比對(duì),相信這個(gè)數(shù)據(jù)集還有很多信息沒有被挖掘怖辆,共勉是复。
R代碼匯總
提取位點(diǎn),群體指標(biāo)計(jì)算
#基本參數(shù)
gene_list <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene_len.txt" #目標(biāo)區(qū)間
out_name <- "C:\\Users\\jinghai\\Desktop\\dailywork\\9-2\\FTIP_gene" #輸出文件前綴
data_name <- c(".\\data\\base_filtered_v0.7",".\\data\\Nipponbare_indel") #SNP數(shù)據(jù)集,輸入多個(gè)就是合并
pop_name<-c("GJ","XI","admix","aus","bas","GJ XI","admix aus bas") #需要生成的群體
fst_pop <- c("GJ XI","admix aus bas") #計(jì)算fst需要的群體 需在pop_name出現(xiàn)過(guò)
#質(zhì)控參數(shù)
maf <- 0.05
miss <- 0.4
color.rgb <- colorRampPalette(rev(c("white","red")),space="rgb") #color for LD
target_gene <- read.table(gene_list,header = T)
extract <- cbind(target_gene $CHROM,target_gene $start,target_gene $end,target_gene $GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
###
#提取目標(biāo)區(qū)域位點(diǎn) 并輸出vcf文件 可以用Tassel 看LD
#查看樣本是否缺少
tem <- c()
for (i in 1:length(data_name)) {
temp <- read.table(paste0(data_name[i],".fam"),header = F)
temp <- nrow(temp)
tem <- c(tem,temp)
}
#用少了樣本的數(shù)據(jù)集來(lái)keep
keep_file <- data_name[which(tem %in% min(tem))]
for (i in 1:length(data_name)) {
total_cmd <- paste0(".//plink --bfile ",data_name[i] ," --extract temp_extract.txt --range --keep ",paste0(keep_file,".fam")," --recode vcf-iid --out ",paste0(out_name,"_",i))
system(total_cmd,intern = FALSE)
}
temp_all <- read.table(paste0(out_name,"_",1,".vcf"),header = F,sep = '\t')
if (length(data_name)>=2){ #合并數(shù)據(jù)集
for (i in 2:length(data_name)) {
temp <- read.table(paste0(out_name,"_",i,".vcf"),header = F,sep = '\t')
temp_all<-rbind(temp_all ,temp )
}
temp_all<-temp_all[order(temp_all[,1],temp_all[,2]),] #按照染色體和POS排序
}
### 列名
fam_name<-read.table(paste0(keep_file,".fam") ,header = F,sep = ' ')[,1]
vcf_colnames<-c("#CHROM", "POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
vcf_colnames<-c(vcf_colnames,fam_name)
###輸出結(jié)果竖螃,vcf
names(temp_all)<-vcf_colnames
write.table(temp_all,file=paste0(out_name,"_","target_all.vcf"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE)
###篩選
total_cmd <- paste0("plink --vcf ",paste0(out_name,"_","target_all.vcf")," --const-fid --maf ",maf," --geno ",miss," --make-bed --out ",paste0(out_name,"_","target_all")) #直接用vcf篩選會(huì)有點(diǎn)問題淑廊,還是再轉(zhuǎn)下bed格式
system(total_cmd,intern = FALSE)
temp <- read.table(paste0(out_name,"_","target_all",".fam"),header = F)
temp[,1] <- temp[,2]
write.table(temp,file=paste0(out_name,"_","target_all",".fam"),sep = " ",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE)
total_cmd <- paste0(".//plink --bfile ",paste0(out_name,"_","target_all") ," --recode vcf-iid --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
#不同群體提取 按基因提取
for(g in target_gene$GENE){
extract <-dplyr::filter(target_gene ,GENE==g)
extract <- cbind(extract$CHROM,extract$start,extract$end,extract$GENE)
write.table(extract,file="temp_extract.txt",sep = "\t",append = FALSE, row.names = FALSE, col.names = FALSE, quote = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range "," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --extract temp_extract.txt --range --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g))
system(total_cmd,intern = FALSE)
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode vcf-iid --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode HV --snps-only just-acgt --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
### total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",g)," --within plink_cluster.txt --keep-cluster-names ",p," --recode --make-bed --out ",paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p)))
### system(total_cmd,intern = FALSE)
}
}
for (p in pop_name) {
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --within plink_cluster.txt --keep-cluster-names ",p," --make-bed --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
###
library(LDheatmap)
library(genetics)
plot_LD <- function(vcf_file_name,title){
temp_vcf <- as.data.frame(data.table::fread(vcf_file_name,header = T,sep = '\t'))
if(nrow(temp_vcf )<=1) {print(paste0(vcf_file_name," have no snp&indel"));return()}###沒有就結(jié)束
temp_snp <- temp_vcf[0,]
for (i in 1:nrow(temp_vcf)) {
temp <- temp_vcf[i,]
for(j in 10:ncol(temp_vcf)){
if(temp_vcf[i,j]=="0/0"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,4])
next
}
if(temp_vcf[i,j]=="0/1"){
temp[1,j] <- paste0(temp[1,4],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="1/1"){
temp[1,j] <- paste0(temp[1,5],"/",temp[1,5])
next
}
if(temp_vcf[i,j]=="./."){
temp[1,j] <- NA
next
}
}
temp_snp <- rbind(temp_snp,temp)
}
SNPpos <- temp_snp$POS
SNP <- as.data.frame(t(temp_snp[,10:ncol(temp_snp)]))
### 轉(zhuǎn)換成LDheatmap能用的格式
for(i in 1:ncol(SNP)){
SNP[,i]<-as.genotype(SNP[,i])
}
###畫圖
LD <- LDheatmap(SNP, SNPpos,flip=TRUE,color=color.rgb(20),title = title,SNP.name=NULL)
}
for(g in target_gene$GENE){
plot_LD(paste0(out_name,"_","target_all_",g,".vcf"),g)
for (p in pop_name) {
plot_LD(paste0(out_name,"_","target_all_",g,"_",gsub(" ","_",p),".vcf"),paste0(g,"_",gsub(" ","_",p)))
}
}
###
total_pop<-length(fst_pop)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --fst --within plink_cluster.txt"," --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in fst_pop){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --fst --within plink_cluster.txt --keep-cluster-names ",p," --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".fst"),header=T)
fst <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
fst[,1] <- temp_table$FST
for (i in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",fst_pop[i]),".fst"),header=T)
fst[,i+1] <- temp$FST
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,fst)
names(temp) <- c("CHR","POS","fst",paste("fst",gsub(" ","_",fst_pop),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".fst.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
###
total_pop<-length(pop_name)
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all")," --freq --out ",paste0(out_name,"_","target_all"))
system(total_cmd,intern = FALSE)
###
for(p in pop_name){
total_cmd <- paste0(".\\plink --bfile ",paste0(out_name,"_","target_all_",gsub(" ","_",p))," --freq --out ",paste0(out_name,"_","target_all_",gsub(" ","_",p)))
system(total_cmd,intern = FALSE)
}
temp_table<-read.table(paste0(out_name,"_","target_all",".frq"),header=T)
pi <- as.data.frame(matrix(nr=nrow(temp_table),nc=total_pop+1))
pi[,1] <- 2*temp_table$MAF*(1-temp_table$MAF)*temp_table$NCHROBS/(temp_table$NCHROBS-1)
for (j in 1:total_pop) {
temp <- read.table(paste0(out_name,"_","target_all_",gsub(" ","_",pop_name[j]),".frq"),header=T)
pi[,j+1] <- 2*temp$MAF*(1-temp$MAF)*temp$NCHROBS/(temp$NCHROBS-1)
}
temp_all <- read.table(paste0(out_name,"_","target_all",".vcf"),header = F,sep="\t")
temp <- cbind(CHR=temp_all$V1,POS=temp_all$V2,pi)
names(temp) <- c("CHR","POS","pi",paste("pi",gsub(" ","_",pop_name),sep = "_"))
write.table(temp,file=paste0(out_name,"_","target_all",".pi"),sep = "\t",append = FALSE, row.names = FALSE, col.names = T, quote = FALSE) #保存結(jié)果
###
vcf to fasta
phase_file <- "D:\\beagle\\phased.vcf.vcf"
g <- g
vcf <- data.table::fread(phase_file,header = T)
temp <- c()
for (i in 1:nrow(vcf)) {
temp_value <- vcf[i,10:ncol(vcf)]
ref <- vcf[i,4]
alt <- vcf[i,5]
if(nchar(alt)>nchar(ref)){
alt <- substr(alt,2,nchar(alt))
ref <- paste(rep("-",nchar(alt)),collapse = "")
}
if(nchar(ref)>nchar(alt)){
ref <- substr(ref,2,nchar(ref))
alt <- paste(rep("-",nchar(ref)),collapse = "")
}
temp_value[,which(temp_value=="0|0")] <- paste(ref,ref)
temp_value[,which(temp_value=="0|1")] <- paste(ref,alt)
temp_value[,which(temp_value=="1|0")] <- paste(alt,ref)
temp_value[,which(temp_value=="1|1")] <- paste(alt,alt)
temp <- rbind(temp,temp_value)
}
temp <- as.data.frame(temp)
cat("",file=paste0(out_name,g,".fasta"),append = FALSE)
for (i in 1:ncol(temp)) {
c1 <- c()
c2 <- c()
for (j in 1:nrow(temp)) {
c1 <- c(c1,strsplit(temp[j,i]," ")[[1]][1])
c2 <- c(c2,strsplit(temp[j,i]," ")[[1]][2])
}
cat(paste(">",paste0(names(temp)[i],".1"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c1,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste(">",paste0(names(temp)[i],".2"),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
cat(paste0(paste(c2,collapse=""),"\n"),file=paste0(out_name,g,".fasta"),append = TRUE)
}
nex traits 文件生成
clust <- read.table("plink_cluster.txt",header = T,sep = "\t")
hap <- read.table("test.txt",header = F,sep = ":")
hap[,1] <- gsub("\\[","",hap[,1])
hap[,2] <- gsub("\\]","",hap[,2])
clut_hap <- clust[,2:3]
tem <- as.data.frame(strsplit(hap[,2]," "))[2,]
hap[,3] <-t(tem)
hap[,3] <- gsub("\\.1","",hap[,3])
hap[,3] <- gsub("\\.2","",hap[,3])
for (i in 1:nrow(clust)) {
flag <- 0
v1 <- 0
v2 <- 0
for (j in 1:nrow(hap)) {
if (flag==2) break
if(clust[i,1] %in% strsplit(hap[j,3]," ")[[1]]){
idx <- which(strsplit(hap[j,3]," ")[[1]] %in% clust[i,1])
if(length(idx)==1&flag==0) {v1 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==1&flag==1) {v2 <- hap[j,1];flag <- flag+1;next}
if(length(idx)==2&flag==0) {v1 <- hap[j,1];v2 <- hap[j,1];next}
}
}
clut_hap[i,3] <- v1
clut_hap[i,4] <- v2
}
write.table(clut_hap,file=paste0(out_name,g,"_clut_hap.txt"),sep = "\t",append = FALSE, row.names = FALSE, col.names = F, quote = FALSE) #保存結(jié)果
names(clut_hap) <- c("ID","group"," "," ")
tem <- rbind(clut_hap[,c(3,2)],clut_hap[,c(4,2)])
tem <- tem[!is.na(tem[,2]),]
temp <- table(tem, exclude = 0)
out_matrix <- data.frame(matrix(temp,nrow = length(row.names(temp))))
row.names(out_matrix) <- row.names(temp)
names(out_matrix) <- gsub(" ","_",colnames(temp))
out_matrix$idx <- as.numeric(gsub("Hap_","",row.names(out_matrix)))
out_matrix <- out_matrix[order(out_matrix$idx),]
out_matrix <- out_matrix[,-ncol(out_matrix)]
cat("\n",file=paste0(out_name,g,".prenex"),append = FALSE)
cat("BEGIN TRAITS;\n",file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Dimensions NTRAITS=",ncol(out_matrix),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Format labels=yes missing=? separator=Tab",";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("TraitLabels\t",paste(names(out_matrix),collapse = "\t"),";\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
cat(paste0("Matrix","\n"),file=paste0(out_name,g,".prenex"),append = TRUE)
write.table(out_matrix,file=paste0(out_name,g,".prenex"),sep = "\t",append = TRUE,row.names = TRUE, col.names = F, quote = FALSE)
方差分析
pheno <- read.table("grain_length.txt",header = F,sep = "\t")
genotype <- read.table(paste0(out_name,g,"_clut_hap.txt"),header = F,sep = "\t")
thr <- 50
tem <- merge(genotype[,c(1,4)],pheno,by=c("V1"))
colnames(tem) <- c("V1","V3","V2")
tem <- rbind(tem, merge(genotype[,c(1,3)],pheno,by=c("V1")))
tem <- merge(tem,genotype[,c(1,2)],by= c("V1"))
colnames(tem) <- c("ID","hap","pheno","cluster")
freq <- as.data.frame(table(tem$hap))
pass_filt <- as.character(freq[freq[,2]>=thr,1])
data <- tem[which(tem$hap %in% pass_filt),]
library(car)
library(agricolae)
mod <- aov(pheno~factor(hap),data)
Anova(mod,type=3)
oneway.test(pheno~factor(hap),data,var.equal = F)
out <- scheffe.test(mod,"factor(hap)") #可以用于非平衡數(shù)據(jù)的多重比較
out$groups
plot(out)
吐槽一下,簡(jiǎn)書的這個(gè)界面太不好寫了特咆,也不好看季惩。代碼中難免會(huì)有多個(gè)空格,逗號(hào)不對(duì)啥的腻格,請(qǐng)見諒画拾。