GWAS imputation是什么?
- Genotype imputation 是運(yùn)用連鎖不平衡的原理依據(jù)一個高密度的參考基因組填補(bǔ)要研究數(shù)據(jù)的一種方法。
- 常用的參考基因組數(shù)據(jù)庫包括The HapMap Consortium database (International HapMap 3 Consortium, 2010)老速、the 1000 Genomes Project (1KG; 1000 Genomes Project Consortium,2012)。千人基因組計劃涵蓋了世界各地的1,092個人的全基因組信息 (其中181 samples from Admixed American, 246 from African, 286 from East Asian, and 379 from European ancestry groups)凸主。
原理如下圖:
為什么要做GWAS imputation橘券?
常用的GWAS芯片大約60萬個位點(diǎn),經(jīng)過質(zhì)控后大約只剩下30多萬個位點(diǎn)卿吐,對于全基因組30億個堿基來說旁舰,只覆蓋了全基因組萬分之一的區(qū)域,因此大片的區(qū)域為空白嗡官。經(jīng)過基因型填補(bǔ)后箭窜,SNP密度大大增加,如果與表型相關(guān)聯(lián)的位置沒有SNP衍腥,填補(bǔ)前是沒有顯著性的磺樱,填補(bǔ)后則有可能出現(xiàn)顯著性。因此紧阔,基因型填補(bǔ)可以大大增加GWAS的統(tǒng)計效能坊罢。
GWAS imputation 常用軟件
- Minimac, IMPUTE2,PLINK, BEAGLE, MaCH+minimac, fastPHASE
GWAS imputation 步驟
1擅耽、在線資源
- 很多在線的GWAS imputation服務(wù)網(wǎng)站活孩,上傳數(shù)據(jù)即可,如https://imputationserver.sph.umich.edu/index.html#!pages/help
https://imputation.sanger.ac.uk/
2乖仇、本地運(yùn)行
imputation一般步驟:
- Liftover PED/MAP files from build 36 to build 37:由于基因組版本的問題憾儒,數(shù)據(jù)中SNP位點(diǎn)的位置在不同的版本中可能有差異,因此要轉(zhuǎn)換為統(tǒng)一的版本以便后續(xù)比對乃沙。一般被稱為pre-processing過程起趾。
具體包括:
- Create a ?bed file based on MAP file.
- Map the bed file to build 37 using ?UCSVC liftover tool.
- Create list of unmapped SNPs.
- Create plink mappings file.
- Create new plink data without unmapped SNPs using ?Plink.
- Quality control of study data:對原始數(shù)據(jù)進(jìn)行質(zhì)控(Quality control)提高填補(bǔ)的準(zhǔn)確性
具體包括:
- 將要填補(bǔ)的數(shù)據(jù)與參考基因組比對,看SNP在正鏈還是負(fù)鏈上警儒,過濾掉翻轉(zhuǎn)的SNP
- 要求SNP符合哈迪溫伯格平衡>10-4, MAF>0.01, call rate>0.95训裆。不符合條件的SNP被過濾掉。
- 檢查SNP是否在參考基因組中蜀铲,有的參考基因組中沒有這個SNP边琉,也會被過濾掉
- 檢查位點(diǎn)在要填補(bǔ)的數(shù)據(jù)和參考基因組中的頻率是否差異過大,如果差異超過25%該SNP將會被過濾掉
- 比較數(shù)據(jù)和參考基因組之間的haplotype結(jié)構(gòu)差異(r-squared > 0.1的SNP)记劝,差異過大的SNP被過濾掉变姨。
- Phasing the study data
具體包括:
- Gather the map files as provided in the ALL_1000G_phase1integrated_v3_impute.tgz from the impute website.
- Phase the study data using the map files.
- Imputing study data
- 將數(shù)據(jù)按染色體分開,用 Impute2軟件填補(bǔ)經(jīng)過phased的數(shù)據(jù), 一般填補(bǔ)最小分辨率為5 million base厌丑。
- 將數(shù)據(jù)合并到一起
流程:
由于以上步驟很多定欧,為了方便imputation過程渔呵,很多人開發(fā)了自動化的pipeline,如
- https://github.com/CNSGenomics/impute-pipe
- https://github.com/pgxcentre/genipe
- http://www.arrayserver.com/wiki/index.php?title=Genetics_GwasImputePipeline.pdf
- https://github.com/molgenis/molgenis-pipelines
step1: QC
##QC1: Remove SNPs with missing genotype rate >5%
${plink} --bfile ${outpath4step}/${name} --missing --out ${outpath4step}/${name}.plink
awk 'NR < 2 { next } $5 > 0.05' ${outpath4step}/${name}.plink.lmiss > ${outpath4step}/${name}.plink.lmiss.exclude
${plink} --bfile ${outpath4step}/${name} \
--exclude ${outpath4step}/${name}.plink.lmiss.exclude \
--make-bed --out ${outpath4step}/${name}.plink_QC1
##QC2 :Remove SNPs deviated from HWE (10e-4)
${plink} --bfile ${outpath4step}/${name}.plink_QC1 \
--hardy --out ${outpath4step}/${name}.plink_QC1
awk '$3=="UNAFF" && $9 <0.0001 {print $2}' ${outpath4step}/${name}.plink_QC1.hwe \
> ${outpath4step}/${name}.plink_QC1.hwe.exclude
${plink} --bfile ${outpath4step}/${name}.plink_QC1 \
--exclude ${outpath4step}/${name}.plink_QC1.hwe.exclude \
--make-bed --out ${outpath4step}/${name}.plink_QC2
##QC3 :Remove MAF>0.01
${plink} --bfile ${outpath4step}/${name}.plink_QC2 \
--maf 0.01 \
--make-bed \
--out ${outpath4step}/${name}.plink_QC3
Step2: Alignment of the SNPs
# make sure that the GWAS dataset is well aligned with the reference panel of haplotypes
# if not, use the UCSC liftOver tool to perform the conversion to build37 coordinates
# download 1000 genome refernce file
for i in {1..22}
do
nohup wget -c -q http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz &
nohup wget -c -q http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi &
done
# Combine all chromosomes in a single file
bcftools concat ALL.chr{1..22}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-Oz -o ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --threads 48
# Preparing a VCF file
bgzip -c example.vcf > example.vcf.gz
tabix -p vcf example.vcf.gz
# convert VCF to PLINK
# To build PLINK compatible files from the VCF files, duplicate positions and SNP id need to be merged or removed.
# remove all duplicate entries.
# Catalogue duplicate SNP id:
grep -v '^#' <(zcat ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | cut -f 3 | sort | uniq -d > ALL.autosomes.dups
# Using BCFTools, split multi-allelic SNPs, and using plink remove duplicate SNPs id found in previous step:
bcftools norm -d both -m +any -Ob ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --threads 64 \
| plink --bcf /dev/stdin --make-bed --out ALL.autosomes --allow-extra-chr 0 --memory 60000 --exclude ALL.autosomes.dups
java -jar GenotypeHarmonizer.jar \
--inputType PLINK_BED \
--input /hapmap3CeuChr20B37Mb6RandomStrand \
--update-id \
--outputType PLINK_BED \
--output /all_chrs \
--refType VCF \
--ref /ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
step3.pre-phasing using SHAPEIT2
shapeit -B all_chrs \
-M /home/zhouwei/zhou_data/GWAS/imputation/impute-pipe/data/reference/ALL_1000G_phase1integrated_v3_impute/genetic_map_chr20_combined_b37.txt \
-O phased_all_chrs -T 64
Step 4: Imputation into pre-phased haplotypes
impute2 -use_prephased_g -Ne 20000 -iter 30 -align_by_maf_g -os 0 1 2 3 -seed 1000000 \
-o_gz -int 1 5e6 \
-h /ALL_1000G_phase1integrated_v3_chr20_impute.hap.gz \
-l /ALL_1000G_phase1integrated_v3_chr20_impute.legend.gz \
-m /genetic_map_chr20_combined_b37.txt \
-known_haps_g phased_all_chrs.haps \
-o /chr20.chunk1
impute2 \
-use_prephased_g -Ne 20000 -iter 30 -align_by_maf_g -os 0 1 2 3 -seed 1000000 -o_gz -int 5e6 10e6 \
-h /ALL_1000G_phase1integrated_v3_chr20_impute.hap.gz \
-l /ALL_1000G_phase1integrated_v3_chr20_impute.legend.gz \
-m /genetic_map_chr20_combined_b37.txt \
-known_haps_g phased_all_chrs.haps \
-o chr20.chunk2
Step 5: Combine all the chunks
cat chr22.chunk1.gz chr22.chunk2.gz > chr22_chunkAll.gen.gz
待完善
參考:
http://www.bbmriwiki.nl/wiki/Impute2Pipeline
http://databeauty.com/blog/tutorial/2017/02/20/GWAS-prephasing-and-imputation.html