全基因組關(guān)聯(lián)分析是基于數(shù)量性狀連鎖不平衡理論胜榔,通過(guò)表型與基因型關(guān)聯(lián)分析獲得影響表型性狀的顯著關(guān)聯(lián)標(biāo)記蜕径。
1. 軟件
Tassel / Gaipt / FarmCPU / mrMLM / PLINK / Structure / SPAGeDi
以上軟件都是在 GWAS 分析中可能用到的軟件耕魄。
2. 數(shù)據(jù)
Genotype ( hmp || vcf format file ) + Penotype ( Traits )
關(guān)于表型的處理請(qǐng)見(jiàn)前一篇文章添谊。
3. 分析流程
從原始數(shù)據(jù)的獲得到 Call SNP 的流程請(qǐng)參考之前文章痪伦。
本文的分析從 hmp/vcf 格式的基因型文件開(kāi)始荣回。
3.1 基因型過(guò)濾
過(guò)濾標(biāo)準(zhǔn)如下:
1. Filter SNP Call Rate less than 90 % (max missing rate is 0.1)
2. Filter mult-allelic sites (include only bi-allelic sites)
3. Filter SNP heterozygosity rate more than 0.2
4. Hardy-Weinberg Equilibrium using an exact test pvalue less than 0.01
5. Minor Allele Frequency (MAF) <0.05
6. Filter Sample Call Rate less than 90 % (max sample call missing rate is 0.1)
7. Missing Rate < 0.05 (For Structure) **
8. Genetic Diversity (GD > 0.45, For Structure) **
具體操作流程如下:
## 1. 使用 Tassel 5 進(jìn)行轉(zhuǎn)換遭贸,可以用 GUI 操作,命令行操作如下:
# Step 1: sort hmp foramt file
/opt/bio/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile my_genotype.hmp.txt -outputFile my_genotype.sort.hmp.txt -fileType Hapmap
# Step 2 : convert hmp to vcf
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -h my_genotype.sort.hmp.txt -export -exportType VCF
## 切記驹马,需要把基因型中 -- 轉(zhuǎn)換為 NN
## 切記革砸,需要把基因型中 -- 轉(zhuǎn)換為 NN
## 切記除秀,需要把基因型中 -- 轉(zhuǎn)換為 NN
## 2. 使用 Plink 軟件對(duì)進(jìn)行基因型過(guò)濾
E蠢郏 Step3: filter genotype
/opt/bio/plink-1.9/plink --vcf LiZL_Genotype_Sort.vcf --maf 0.05 --geno 0.05 --mind 0.7 --recode 'vcf' --biallelic-only strict --out LiZL_Genotype_Filter
## Step 3: Convert to normal format
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -vcf ZXX_Genotype_Filter.vcf -export ZXX_Genotype_Filter.hmp.txt -exportType HapmapDiploid
3.2 群體結(jié)構(gòu)
SNP 標(biāo)記過(guò)濾完成后,使用程序在各條染色體上隨機(jī)均勻提取一定數(shù)目的標(biāo)記用來(lái)計(jì)算 Structure 群體結(jié)構(gòu)册踩,之所以不使用全部的標(biāo)記是因?yàn)橛窘悖繕?biāo)記間連鎖作用明顯并且計(jì)算量龐大。
一般使用 5000 個(gè)標(biāo)記進(jìn)行群體分層估算暂吉。同時(shí)群體所需要的標(biāo)記更嚴(yán)格一些胖秒,過(guò)濾標(biāo)準(zhǔn)如下:
1. missing rate < 0.05,
2. Genetic diversity ( GD > 0.45 )
使用 plink 軟件進(jìn)行 Structure 數(shù)據(jù)格式轉(zhuǎn)化:
## Step1 sort hapmap
/opt/bio/tassel-5-standalone/run_pipeline.pl -SortGenotypeFilePlugin -inputFile ForStructure.txt -outputFile For_Q.sort.hmp.txt -fileType Hapmap
## Step2 Convert hapmap
/opt/bio/tassel-5-standalone/run_pipeline.pl -fork1 -h For_Q.sort.hmp.txt -export For_Q -exportType VCF
## Step3 Make for Q matrix
/opt/bio/plink-1.9/plink --vcf For_Q.vcf --recode structure --out Structure
數(shù)據(jù)準(zhǔn)備好之后,使用經(jīng)典的 Structure 軟件進(jìn)行群體 K 值估算慕的。
Structure 軟件運(yùn)行參數(shù)設(shè)置(Command-line version):
群體結(jié)構(gòu)主要有兩個(gè)參數(shù)文件阎肝, mainparams 和 extraparams 文件,具體設(shè)置如下:
每個(gè)參數(shù)均以 #define 開(kāi)頭:
KEY PARAMETERS FOR THE PROGRAM structure. YOU WILL NEED TO SET THESE IN ORDER TO RUN THE PROGRAM. VARIOUS OPTIONS CAN BE ADJUSTED IN THE FILE extraparams.
基本參數(shù)配置 ( mainparams ):
Basic Program Parameters
#define MAXPOPS 2 // (int) 預(yù)設(shè)的亞群數(shù)
#define BURNIN 10000 // (int) length of burnin period
#define NUMREPS 20000 // (int) number of MCMC reps after burnin
Input/Output files
#define INFILE infile // (str) 輸入文件
#define OUTFILE outfile //(str) 輸出的結(jié)果文件
Data file format
#define NUMINDS 100 // (int) 材料數(shù)目
#define NUMLOCI 100 // (int) 標(biāo)記數(shù)目
#define PLOIDY 2 // (int) 幾倍體
#define MISSING -9 // (int) 基因型缺失值
#define ONEROWPERIND 0 // (B) 數(shù)據(jù)格式(單行/雙行)
#define LABEL 1 // (B) 是否包含材料名稱
#define POPDATA 1 // (B) Input file contains a population identifier
#define POPFLAG 0 // (B) Input file contains a flag which says whether to use popinfo when USEPOPINFO==1
#define LOCDATA 0 // (B) Input file contains a location identifier
#define PHENOTYPE 0 // (B) Input file contains phenotype information
#define EXTRACOLS 0 // (int) Number of additional columns of data before the genotype data start.
#define MARKERNAMES 1 // (B) 是否包含標(biāo)記名稱(首行)
#define RECESSIVEALLELES 0 // (B) data file contains dominant markers (eg AFLPs)
// and a row to indicate which alleles are recessive
#define MAPDISTANCES 1 // (B) data file contains row of map distances
// between loci“菇帧(plink 軟件出來(lái)格式有該行 )
Advanced data file options
#define PHASED 0 // (B) Data are in correct phase (relevant for linkage model only)
#define PHASEINFO 0 // (B) the data for each individual contains a line indicating phase (linkage model)
#define MARKOVPHASE 0 // (B) the phase info follows a Markov model.
#define NOTAMBIGUOUS -999 // (int) for use in some analyses of polyploid data
控制參數(shù)配置 ( extraparams ):
EXTRA PARAMS FOR THE PROGRAM structure. THESE PARAMETERS CONTROL HOW THE PROGRAM RUNS. ATTRIBUTES OF THE DATAFILE AS WELL AS K AND RUNLENGTH ARE SPECIFIED IN mainparams.
PROGRAM OPTIONS
#define NOADMIX 0 // (B) Use no admixture model (0=admixture model, 1=no-admix)
#define LINKAGE 0 // (B) Use the linkage model model
#define USEPOPINFO 0 // (B) Use prior population information to pre-assign individuals to clusters
#define LOCPRIOR 0 //(B) Use location information to improve weak data
#define FREQSCORR 1 // (B) allele frequencies are correlated among pops
#define ONEFST 0 // (B) assume same value of Fst for all subpopulations.
#define INFERALPHA 1 // (B) Infer ALPHA (the admixture parameter)
#define POPALPHAS 0 // (B) Individual alpha for each population
#define ALPHA 1.0 // (d) Dirichlet parameter for degree of admixture (this is the initial value if INFERALPHA==1).
#define INFERLAMBDA 0 // (B) Infer LAMBDA (the allele frequencies parameter)
#define POPSPECIFICLAMBDA 0 //(B) infer a separate lambda for each pop (only if INFERLAMBDA=1).
#define LAMBDA 1.0 // (d) Dirichlet parameter for allele frequencies
PRIORS
#define FPRIORMEAN 0.01 // (d) Prior mean and SD of Fst for pops.
#define FPRIORSD 0.05 // (d) The prior is a Gamma distribution with these parameters
#define UNIFPRIORALPHA 1 // (B) use a uniform prior for alpha; otherwise gamma prior
#define ALPHAMAX 10.0 // (d) max value of alpha if uniform prior
#define ALPHAPRIORA 1.0 // (only if UNIFPRIORALPHA==0): alpha has a gamma prior with mean A*B, and
#define ALPHAPRIORB 2.0 // variance A*B^2.
#define LOG10RMIN -4.0 //(d) Log10 of minimum allowed value of r under linkage model
#define LOG10RMAX 1.0 //(d) Log10 of maximum allowed value of r
#define LOG10RPROPSD 0.1 //(d) standard deviation of log r in update
#define LOG10RSTART -2.0 //(d) initial value of log10 r
USING PRIOR POPULATION INFO (USEPOPINFO)
#define GENSBACK 2 //(int) For use when inferring whether an individual is an immigrant, or has an immigrant ancestor in the past GENSBACK generations. eg, if GENSBACK==2, it tests for immigrant ancestry back to grandparents.
#define MIGRPRIOR 0.01 //(d) prior prob that an individual is a migrant (used only when USEPOPINFO==1). This should be small, eg 0.01 or 0.1.
#define PFROMPOPFLAGONLY 0 // (B) only use individuals with POPFLAG=1 to update P.
LOCPRIOR MODEL FOR USING LOCATION INFORMATION
#define LOCISPOP 1 //(B) use POPDATA for location information
#define LOCPRIORINIT 1.0 //(d) initial value for r, the location prior
#define MAXLOCPRIOR 20.0 //(d) max allowed value for r
OUTPUT OPTIONS
#define PRINTNET 1 // (B) Print the "net nucleotide distance" to screen during the run
#define PRINTLAMBDA 1 // (B) Print current value(s) of lambda to screen
#define PRINTQSUM 1 // (B) Print summary of current population membership to screen
#define SITEBYSITE 0 // (B) whether or not to print site by site results. (Linkage model only) This is a large file!
#define PRINTQHAT 0 // (B) Q-hat printed to a separate file. Turn this on before using STRAT.
#define UPDATEFREQ 100 // (int) frequency of printing update on the screen. Set automatically if this is 0.
#define PRINTLIKES 0 // (B) print current likelihood to screen every rep
#define INTERMEDSAVE 0 // (int) number of saves to file during run
#define ECHODATA 1 // (B) Print some of data file to screen to check that the data entry is correct.
(NEXT 3 ARE FOR COLLECTING DISTRIBUTION OF Q:)
#define ANCESTDIST 0 // (B) collect data about the distribution of ancestry coefficients (Q) for each individual
#define NUMBOXES 1000 // (int) the distribution of Q values is stored as a histogram with this number of boxes.
#define ANCESTPINT 0.90 // (d) the size of the displayed probability interval on Q (values between 0.0--1.0)
MISCELLANEOUS
#define COMPUTEPROB 1 // (B) Estimate the probability of the Data under the model. This is used when choosing the best number of subpopulations.
#define ADMBURNIN 500 // (int) [only relevant for linkage model]: Initial period of burnin with admixture model (see Readme)
#define ALPHAPROPSD 0.025 // (d) SD of proposal for updating alpha
#define STARTATPOPINFO 0 // Use given populations as the initial condition for population origins. (Need POPDATA==1). It is assumed that the PopData in the input file are between 1 and k where k<=MAXPOPS.
#define RANDOMIZE 1 // (B) use new random seed for each run
#define SEED 2245 // (int) seed value for random number generator (must set RANDOMIZE=0)
#define METROFREQ 10 // (int) Frequency of using Metropolis step to update Q under admixture model (ie use the metr. move every i steps). If this is set to 0, it is never used.(Proposal for each q^(i) sampled from prior. The goal is to improve mixing for small alpha.)
#define REPORTHITRATE 0 // (B) report hit rate if using METROFREQ
參數(shù)配置中:
"(int)" means that this takes an integer value.
"(B)" means that this variable is Boolean ( 1 for True, and 0 for False)
"(str)" means that this is a string
Structure 命令行使用:
structure -m mainparams -e extraparams -K MAXPOPS -L NUMLOCI -N NUMINDS -i input file -o output file -D SEED
K 值計(jì)算
程序運(yùn)行結(jié)束后可以將輸出結(jié)果打包上傳到下面這個(gè)網(wǎng)站风题,進(jìn)行K值估算。
點(diǎn)擊進(jìn)入: StructureHarvester
3.3 親緣關(guān)系
GWAS 混合線性模型中,親緣關(guān)系作為協(xié)變量對(duì)結(jié)果有一定的影響沛硅,使用 Tassel / GAPIT / SPAGeDi 等軟件可以輕松計(jì)算親緣關(guān)系矩陣眼刃。
這里我們演示 SPAGeDi 1.4 軟件運(yùn)行過(guò)程。
軟件下載地址:SPAGeDi-1.4b
為了計(jì)算快速摇肌,我們隨機(jī)抽取在染色體上均勻分布的5000個(gè)標(biāo)記進(jìn)行 Kinship 計(jì)算擂红,保證了標(biāo)記的隨機(jī)及代表性操作方法如下:
perl Step3.Random_fetch.pl Genotype_Final.hmp.txt > For_K.hmp.txt
perl Step5.Format_kinship.pl For_K.hmp.txt > K.in
## 得到的輸入文件需要將 NN 轉(zhuǎn)換為 0
SPAGeDi 軟件得到 K 矩陣的后續(xù)處理:
1. 對(duì)角線加1
2. 小于0的系數(shù)用0替換
3. 所有系數(shù)加倍
在獲得群體結(jié)構(gòu) Q 矩陣和親緣關(guān)系 K 矩陣后,我們可以使用 Tassel / Gapit / FarmCPU / mrMLM 等軟件進(jìn)行分析围小,具體操作見(jiàn)下一篇昵骤。
3.4 關(guān)聯(lián)分析
在GWAS研究中,當(dāng)涉及質(zhì)量性狀時(shí)一般采用 Logistic 回歸模型進(jìn)行分析肯适,對(duì)于數(shù)量性狀的研究涉茧,主要采用線性回歸模型進(jìn)行關(guān)聯(lián)分析。
在 Logistic 回歸模型中疹娶,基因型是應(yīng)變量伴栓,群體結(jié)構(gòu)和表型是自變量;
而在線性回歸模型中雨饺,表型是應(yīng)變量钳垮,其他品種、性別额港、群體結(jié)構(gòu)和基因型數(shù)據(jù)則是自變量饺窿。
線性模型包括兩種:一般線性模型(general linear model,GLM)和混合線性模型(mixed linear model移斩,MLM)肚医。復(fù)雜數(shù)量性狀通常受到多種因素的共同影響,而混合模型中可以加入固定效應(yīng)和隨機(jī)效應(yīng)向瓷,因此肠套,以研究數(shù)量性狀的全基因組關(guān)聯(lián)分析方法常采用混合線性模型進(jìn)行分析。
在GWAS中猖任,群體分層(population stratification)和多重假設(shè)檢驗(yàn)(multiple testing adjusting)是引起研究結(jié)果分析誤差的重要原因你稚。處理這兩種誤差的一種可能的策略是采用基于家系的關(guān)聯(lián)研究,該方法可以避免群體分成對(duì)關(guān)聯(lián)分析結(jié)果的影響朱躺。所謂群體分層刁赖,是指群體內(nèi)存在等位基因頻率不同的亞群體。由于自然選擇长搀、遺傳漂變宇弛、群體分層等諸多因素都會(huì)影響到群體中的連鎖不平衡,因此源请,在進(jìn)行關(guān)聯(lián)分析時(shí), 一些非原因等位基因也可以同真實(shí) QTL 形成連鎖不平衡表現(xiàn)為與研究性狀關(guān)聯(lián)枪芒,從而導(dǎo)致偽關(guān)聯(lián)或假陽(yáng)性的出現(xiàn)轿钠。
## Step3.Random_fetch.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
open IN, $ARGV[0] || die $!;
my $head = <IN>;
print $head;
my $total;
my %genotype;
while(<IN>){
chomp;
my @a = split /\t/, $_;
$genotype{$a[2]}{$a[0]} = $_;
if ($a[2] ne "0"){
$total++;
}
}
close IN;
for my $chr (1..10){
my @snp = keys %{$genotype{$chr}};
my $max = scalar (@snp);
my $num = ($max / $total) * 5000;
$num = sprintf ("%d",$num);
my %random = &random($max, $num+1);
foreach my $k (keys %random){
my $id = $snp[$k];
my $out = $genotype{$chr}{$id};
print "$out\n";
}
}
sub random{
my ($max, $num) = @_;
my %hash;
while ((keys %hash) < $num) {
$hash{int(rand($max))} = 1;
}
return %hash;
}
__END__
## Step5.Format_kinship.pl
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
open IN, $ARGV[0] || die $!;
chomp(my $head = <IN>);
my @samples = split /\t/, $head;
my (%hash, @snps);
while(<IN>){
chomp;
my $n = 1;
my @a = split /\t/, $_;
for (my $i=11;$i<@a;$i++){
$a[$i] =~ tr/ATCG/1234/;
my @tmp = split (//, $a[$i]);
my $allen = join ",",@tmp;
$hash{$samples[$i]}{$a[0]} = $allen;
}
push @snps, $a[0];
}
close IN;
print "307\t0\t0\t5005\t1\t2\n0\n";
print join ("\t", "Name", sort @snps)."\n";
foreach my $sample (sort keys %hash){
my @out;
foreach my $snp (sort keys %{$hash{$sample}}){
push @out,$hash{$sample}{$snp};
}
print join ("\t",$sample,@out)."\n";
}
__END__
#!/usr/bin/perl -w
use strict;
open IN, $ARGV[0] || die $!;
my $head = <IN>;
print $head;
while(<IN>){
chomp;
my @a = split /\t/, $_;
for (my $i=1; $i<@a; $i++){
if ($a[$i] le "0.0"){
$a[$i] = "0.0";
}else{
$a[$i] = 2 * $a[$i];
}
}
print join("\t", @a)."\n";
}
close IN;
END
縮略詞:
BSSS, Iowa Stiff Stalk Synthetic;
NSS, Non-Stiff Stalk;
PA, Group A germplasm derived from modern U.S. hybrids;
PB, Group B germplasm derived from modern U.S. hybrids;
SPT, Sipingtou;
SS, Stiff Stalk;
TS, Tropical or semitropical;
SSR, Simple sequence repeat (markers);
PCA, Principal component analysis;
PIC, Polymorphism information content;
RFLP, Restriction fragment length polymorphism;
SNPs, Single nucleotide polymorphisms;
CIMMYT, International Maize and Wheat Improvement Center;
DNA,Deoxyribonucleicacid;
GAPIT, Genomic Association and Prediction Integrated Tool;
GBS, Genotyping by sequencing;
GD, Genetic diversity;
LD, Linkage disequilibrium;
LnP(D), Log likelihood of data;
LRC, Luda Red Cob;
Mb, Megabase;
NJ, Neighborjoining;