連續(xù)純合片段(runs of homozygosity, ROH)是親代將同源相同的單倍型傳遞給子代而形成的野哭,基于ROH計(jì)算的基因組近交系數(shù)是個(gè)體真實(shí)的近交系數(shù),它比傳統(tǒng)系譜計(jì)算的期望近交系數(shù)更加準(zhǔn)確掀抹,而且不再依賴于系譜記錄的準(zhǔn)確性和完整性虐拓,能準(zhǔn)確地反映兩個(gè)配子間的關(guān)系心俗。
ROH片段的長短還可以推斷近交歷史傲武。長的ROH片段反映最近世代發(fā)生過近交,而短的ROH說明較遠(yuǎn)世代產(chǎn)生近交城榛,因?yàn)槭来鷶?shù)越短ROH片段被重組打斷的可能性就越小揪利。
計(jì)算連續(xù)純合片段 runs of homozygosity
芯片文件為例
1.轉(zhuǎn)成vcf
nohup /home/software/plink --allow-extra-chr --chr-set 26 --file project_96s --make-bed --out project_96s &
nohup /home/software/plink --allow-extra-chr --chr-set 26 --bfile project_96s --export vcf --out project_96s &
2.過濾缺失0.1,MAF0.05
只使用常染色體上的位點(diǎn)狠持,最大缺失率(max-missing)<10%即檢出率>90%的SNP位點(diǎn)疟位,最小等位基因頻率(MAF)>0.05(哈代溫伯格平衡檢驗(yàn)的P值大于10^-6 -hwe)。
nohup /home/software/plink --allow-extra-chr --chr-set 26 -bfile project_96s --geno 0.1 --maf 0.05 --make-bed --out project_96s-geno01-maf05 &
3.去多等位基因喘垂,INDEL
nohup /home/software/plink --allow-extra-chr --chr-set 26 --bfile project_96s-geno01-maf05 --export vcf --out project_96s-geno01-maf05 & (bed-vcf)
bgzip -c -f project_96s-geno01-maf05.vcf > project_96s-geno01-maf05.vcf.gz
nohup /home/software/bcftools/bcftools view -m 2 -M 2 --type "snps" project_96s-geno01-maf05.vcf.gz -O z -o project_96s-geno01-maf05-snps.vcf.gz &
4.去重復(fù)標(biāo)記甜刻,對vcf文件進(jìn)行去重并排序,加頭
# 去重
sort -k 1,2 -u project_96s-geno01-maf05-snps.vcf > project_96s-geno01-maf05-snps.sorted.uniq.vcf
# 排序
cat project_96s-geno01-maf05-snps.sorted.uniq.vcf |/home/software/vcftools/src/perl/vcf-sort > project_96s-geno01-maf05-snps.sorted.uniq.recode.vcf
# 提取前40行vcf文件
head -40 project_96s-geno01-maf05-snps.vcf > 40.vcf
# 加頭
cat 40.vcf project_96s-geno01-maf05-snps.sorted.uniq.recode.vcf > project_96s-geno01-maf05-snps.sorted.uniq-40tou.recode.vcf
5.(基因填充beagle)
6.去除0號(hào)染色體
# 刪除0正勒、27得院、30號(hào)染色體
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou.recode.vcf --recode --recode-INFO-all --not-chr 0 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0 &
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0.recode.vcf --recode --recode-INFO-all --not-chr 27 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27 &
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27.recode.vcf --recode --recode-INFO-all --not-chr 30 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30 &
7.轉(zhuǎn)成map ped,調(diào)整ped文件前兩列 章贞,第一列為品種名稱祥绞,第二列為樣品名稱,做出bed bam fam文件
nohup /home/software/vcftools/src/cpp/vcftools --vcf project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.vcf --plink --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &(vcf---map ped)
cat project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.ped | awk -F " " '{$1=null;print}' > project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-1.recode.ped
cat project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-1.recode.ped | awk -F " " '{$1=null;print}' > project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-12.recode.ped
dos2unix 96.txt
paste -d "" 96.txt project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30-12.recode.ped > project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.ped
sed -i 's/ /\t/g' project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode.ped
nohup /home/software/plink --allow-extra-chr --chr-set 26 --file project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode --make-bed --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &(map ped ---bed bim fam)
8.計(jì)算ROH
nohup /home/software/plink --allow-extra-chr --chr-set 26 --bfile project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode --homozyg-window-snp 50 --homozyg-snp 50 --homozyg-kb 300 --homozyg-density 50 --homozyg-gap 1000 --homozyg-window-missing 5 --homozyg-window-threshold 0.05 --homozyg-window-het 03 --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &
#.indiv文件中NSEG為個(gè)數(shù),KB為長度蜕径,KBAVG為平均長度两踏。
將 ROHs 分為了三類,(1)短片段 ROHs(class A)兜喻;(2)中等長度 ROHs(class B)梦染;(3)長片段 ROHs(class C)。
對照的界值為 class A: 500Kb-700Kb, class B: 700Kb-1500Kb, class C:超過 1500Kb 朴皆。
http://events.jianshu.io/p/5f5d78def191
PLINK 中使用滑動(dòng)窗口的方法來進(jìn)行 ROHs 的檢測:特定 SNP 數(shù)目的窗口以一定的步長在 SNP 芯片上滑動(dòng)弓坞,對能夠滿足一定參數(shù)(默認(rèn):缺失率<5,雜合率<1)的窗口進(jìn)行拼接從而完成對 ROHs 的檢測车荔。
這個(gè)過程在 PLINK v1.07中通過命令“homozyg“來完成渡冻,考慮到基因組上存在一定數(shù)量的拷貝數(shù)改變,這種拷貝數(shù)改變(CNV)往往較短忧便,通過 PLINK 進(jìn)行 ROHs 的檢測時(shí)無法將這類拷貝數(shù)改變區(qū)分開來族吻,參考以前的研究,長度在 0.5Mb 以上的拷貝數(shù)改變在健康人群中非常少見珠增,因此將 ROHs 的長度限制設(shè)置為 0.5Mb超歌。另外,如果一個(gè)ROH 包含兩個(gè)距離超過 0.1Mb 的 SNPs蒂教,那么這個(gè) ROH 被打斷為兩個(gè) ROHs巍举。
每個(gè) ROH 包含的 SNPs 數(shù)量閾值尚無統(tǒng)一定論,因此分別評估 SNPs 閾值為50,60,70,80,90,100 個(gè) SNPs 的 ROHs 的情況凝垛。
ROH分析(重測序數(shù)據(jù)分染色體做)
# 求和:
cat D1.hwe | awk '{a+=$7}END{print a}'
# 計(jì)算ROH:
do /home/software/plink --file /home/zhangsj/Wild_European/Plink/Wild_European_chr${K}? --homozyg --out Wild_European_chr${K}
# 得到三個(gè)文件:.hom /.hom.summary / .hom.indiv
#使用indiv文件進(jìn)行統(tǒng)計(jì)(利用上面的求和腳本)懊悯,計(jì)算每個(gè)個(gè)體的平均ROH的長度,和數(shù)量梦皮。
#使用excel進(jìn)行統(tǒng)計(jì)
=SUM(OFFSET(D$1,(ROW(A1)-1)*19,,19)) }
ROH畫圖R腳本
注意:R中使用的map ped 文件均為空格格式L糠帧!剑肯!
install.packages('detectRUNS')
library('detectRUNS')
setwd("D:/putty/Rdata/ROH/detectRUNS_0.9.6/dec")
# getting map and ped paths
genotypeFile <- system.file("dec", "2.recode.ped", package = "detectRUNS")
mapFile <- system.file("dec", "2.recode.map", package = "detectRUNS")
# sliding-window-based run detectionslid
slidingRuns <- slidingRUNS.run( genotypeFile = "2.recode.ped",mapFile = "2.recode.map", windowSize = 15, threshold = 0.05, minSNP = 20, ROHet = FALSE, maxOppWindow = 1, maxMissWindow = 1, maxGap = 10^6, minLengthBps = 250000, minDensity = 1/10^3, maxOppRun = NULL, maxMissRun = NULL )
# consecutive SNP-based run detection
consecutiveRuns <- consecutiveRUNS.run(genotypeFile = "Kijas2016_Sheep_subset.ped",mapFile = "Kijas2016_Sheep_subset.map", minSNP = 20, ROHet = FALSE, maxGap = 10^6, minLengthBps = 250000, maxOppRun = 1, maxMissRun = 1 )
# “Runs of heterozygosity” 計(jì)算雜合
slidingRuns_het <- slidingRUNS.run( genotypeFile = "Kijas2016_Sheep_subset.ped",mapFile = "Kijas2016_Sheep_subset.map", windowSize = 10, threshold = 0.05, minSNP = 10, ROHet = TRUE, maxOppWindow = 2, maxMissWindow = 1, maxGap = 10^6, minLengthBps = 10000, minDensity = 1/10^6, maxOppRun = NULL, maxMissRun = NULL )
consecutiveRuns_het <- consecutiveRUNS.run( genotypeFile = "Kijas2016_Sheep_subset.ped",mapFile = "Kijas2016_Sheep_subset.map", minSNP = 10, ROHet = TRUE, maxGap = 10^6, minLengthBps = 10000, maxOppRun = 2, maxMissRun = 1 )
# Summary statistics on detected runs
summaryList <- summaryRuns( runs = slidingRuns, mapFile = "2.recode.map", genotypeFile = "2.recode.ped", Class = 6, snpInRuns = TRUE)
summaryList$summary_ROH_count # 可以看到雅各布斯綿羊中0-6mbp有894個(gè)ROH捧毛。
summaryList$summary_ROH_mean_chr # 得到每個(gè)染色體和每個(gè)品種的平均ROH數(shù)。
head(summaryList$SNPinRun) # 對于每個(gè)SNP让网,數(shù)據(jù)框“SNPinRun”包含了它在任何給定人群/組中運(yùn)行的次數(shù)比例呀忧。
# The summary information included in the list returned from summaryRuns() is conveniently organized in data.frames, so that it can easily be visualized, manipulated and written out to text files.
# 包含在從summaryRuns()返回的列表中的摘要信息可以方便地組織在data.frames中,這樣就可以方便地可視化溃睹、操作并將其寫入文本文件而账。
# Plots
plot_Runs(runs = slidingRuns[slidingRuns$chrom==2,])
plot_StackedRuns(runs = slidingRuns)
plot_StackedRuns(runs = slidingRuns[slidingRuns$group=="XLL",])
plot_SnpsInRuns(runs = slidingRuns[slidingRuns$chrom==2,], genotypeFile = "2.recode.ped", mapFile = "2.recode.map")
plot_SnpsInRuns( runs = slidingRuns[slidingRuns$chrom==24,], genotypeFile = "Kijas2016_Sheep_subset.ped", mapFile = "Kijas2016_Sheep_subset.map")
topRuns <- tableRuns( runs = slidingRuns, genotypeFile = "2.recode.ped", mapFile ="2.recode.map", threshold = 0.7)
plot_manhattanRuns( runs = slidingRuns[slidingRuns$group=="XLL",], genotypeFile ="2.recode.ped", mapFile = "2.recode.map")
# From runs of homozygosity (ROH)
# getting map and ped paths
genotypeFile <- system.file("dec", "2.recode.ped", package = "detectRUNS")
mapFile <- system.file("dec", "2.recode.map", package = "detectRUNS")
# calculating runs of Homozygosity
## Not run:
# skipping runs calculation
runs <- slidingRUNS.run( genotypeFile= "2.recode.ped", mapFile ="2.recode.map", windowSize = 15, threshold = 0.1, minSNP = 15,ROHet = FALSE, maxOppositeGenotype = 1, maxMiss = 1, minLengthBps = 100000, minDensity = 1/10000)
## End(Not run)
# loading pre-calculated data
runsFile <- system.file("dec", "2.recode.csv", package="detectRUNS")
runsDF <- readExternalRuns(inputFile = runsFile, program = 'detectRUNS')
Froh_inbreeding(runs = runsDF, mapFile = "2.recode.map")
Froh_inbreeding(runs = runsDF, mapFile = "2.recode.map", genome_wide=FALSE)
slidingRuns <- slidingRUNS.run( genotypeFile = "2.recode.ped",mapFile = "2.recode.map", windowSize = 15, threshold = 0.05, minSNP = 20, ROHet = FALSE, maxOppWindow = 1, maxMissWindow = 1, maxGap = 10^6, minLengthBps = 250000, minDensity = 1/10^3, maxOppRun = NULL, maxMissRun = NULL )
ROH畫圖孫腳本
制作391.recode.hom.indiv.txt
Kazakh SRR 246 140312
Kazakh SRR 235 123750
Mongolian SRR 242 124087
Mongolian SRR 212 106493
Mongolian SRR 214 257443
Mongolian SRR 178 81486.2
Jingjiang SRR 647 352246
Jingjiang SRR 612 345693
b=read.table("391.recode.hom.indiv.txt",header=F)
# txt為indiv數(shù)據(jù)
head(b)
library("ggplot2")
qplot(b[,3],b[,4],col=factor(b[,1]),xlab="11",ylab="22")
Breed=b[,1]
png("roh.png",width = 2200,height = 1400,units="px",res=391) # res為分辨率
ggplot(data = b , aes(x = b[,3],y = b[,4],group = Breed,shape = Breed,color = Breed))+labs(x="Number",y="Length")+geom_point(size=1.2)+scale_shape_manual(values = seq(0,75))
dev.off()
x軸為數(shù)量
y軸為長度
plink計(jì)算近交系數(shù) inbreeding coefficient
近交系數(shù)是指根據(jù)近親交配的世代數(shù),將基因的純化程度用百分?jǐn)?shù)來表示即為近交系數(shù)丸凭,也指個(gè)體由于近交而造成異質(zhì)基因減少時(shí)福扬,同質(zhì)基因或純合子所占的百分比也叫近交系數(shù)腕铸,個(gè)體中兩個(gè)親本的共祖系數(shù)。
從基因?qū)用鎭碚f铛碑,近親婚配的后果就是一個(gè)基因的allele來自共同祖先狠裹,即血緣同源IBD。 近交系數(shù)的定義為在一個(gè)常染色體基因座上獲得 2 個(gè)同源相同(identity by descent, IBD)等位基因的概率汽烦。為了更客觀的描述個(gè)體間近親婚配情況涛菠,提出以下兩個(gè)概念:
1.coffcient of relationship,針對兩個(gè)個(gè)體間,表示的是兩個(gè)個(gè)體間來自共同祖先的同源基因比例撇吞,稱之為共祖系數(shù)俗冻。
2.cofficient of inbreeding,針對一個(gè)個(gè)體,表示的是該個(gè)體任意一個(gè)基因的兩個(gè)allele來自同一個(gè)祖先的概率牍颈,稱之為近交系數(shù)迄薄。
1.基于基因組關(guān)系矩陣的近交系數(shù)(FGRM)
2.基因組近交系數(shù)
(1)基于ROH計(jì)算的基因組近交系數(shù) (FROH) -homozyg
FROH:Froh refers to the proportion of the genome (0-1) that is in runs of homozygosity (ROHs).
# 計(jì)算LROH(分染色體做,后合在一起)
vi ROH.sh
for K in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ;do /home/software/vcftools/src/cpp/vcftools --vcf test.vcf --LROH --chr ${K} --out test.vcf.${K}ROH ;done
nohup bash ROH.sh &
cat *.LROH > all.LROH # 將所有的.ROH制作成all.ROH,打開去除表頭
# 芯片v3.0中綿羊基因組全長2653.84Mb
# 胖笏辏基因組全長2489.37Mb
# 重測序綿羊基因組全長
R統(tǒng)計(jì)LROH
a=read.table("all.LROH",header=FALSE)
cha=a[,3]-a[,2]
jishu=aggregate(cha,by=list(a[,8]),length)
zc=aggregate(cha,by=list(a[,8]),sum)
aa=data.frame(jishu,zc)
write.table(aa,"aa.txt",col.names=F,row.names=F,quote=F,sep="\t")
(2)基于基因組雜合度的近交系數(shù)(Fhom) -het (使用plink軟件根據(jù)snp同源性計(jì)算近交系數(shù))
Fh is the canonical estimate of genomic F based on excess SNP homozygosity.
[Keller Matthew C,Visscher Peter M,Goddard Michael E. Quantification of inbreeding due to distant ancestors and its detection using dense single nucleotide polymorphism data.[J]. Genetics,2011,189(1).]
nohup /home/software/plink --noweb --file project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode --het --out project_96s-geno01-maf05-snps.sorted.uniq-40tou-noinclude0-27-30.recode &
#該分析將自動(dòng)跳過單倍體標(biāo)記(男性X和Y染色體標(biāo)記)
#.het文件中 FID:家系ID讥蔽,IID:個(gè)體ID,O(HOM):觀察到的純合子數(shù)量画机,E(HOM):期望的純合子數(shù)量冶伞,N(NM):非缺失基因型數(shù)量,F(xiàn):近交系數(shù)估計(jì)(本例:F=0.036538)
值得注意的是步氏,從概念的定義可以看出响禽,F(xiàn)值理論上是位于0到1范圍內(nèi)的正數(shù),而軟件的計(jì)算結(jié)果中會(huì)出現(xiàn)負(fù)數(shù),這通常是計(jì)算過程中隨機(jī)抽樣的誤差荚醒,說明該計(jì)算結(jié)果不是很可靠芋类。但是如果負(fù)值非常大,比如-0.5以上腌且,這說明這個(gè)樣本雜合子較多梗肝,可能存在了DNA的污染,其分型結(jié)果是有問題的铺董。
例如:大部分個(gè)體沒有近交情況,絕大多數(shù)個(gè)體近交系數(shù)在0~0.15禀晓,少部分個(gè)體近交程度比較高精续,近交系數(shù)最高為0.25。[郝鑫宇,馮健,孫延曉,陳琳琳,王起山,潘玉春.保種豬場系譜記錄粹懒、管理及群體遺傳結(jié)構(gòu)分析——以萊蕪豬為例[J].中國畜牧雜志,2020,56(04):60-65.]
3.系譜近交系數(shù)
系譜近交系數(shù) (FPED)
plink計(jì)算親緣系數(shù)
http://www.reibang.com/p/8340055d3646?utm_campaign=haruki
親緣系數(shù)又稱血緣系數(shù)重付,親屬間遺傳相關(guān)系數(shù),親緣相關(guān)系數(shù)凫乖。是指將群體中個(gè)體之間基因組成的相似程度用數(shù)值來表示即為血緣系數(shù)确垫。意義即擁有共同祖先的兩個(gè)人弓颈,在某一位點(diǎn)上具有相同基因的概率。它與近交系數(shù)不同 , 近交系數(shù) Fχ是說明 x 這個(gè)個(gè)體本身是什么程度的近交產(chǎn)生的 , 而親緣系數(shù)則是反映兩個(gè)個(gè)體間的遺傳相關(guān)(親緣)程度删掀。
#該步驟進(jìn)行樣本間的親緣關(guān)系估計(jì)翔冀,并將清除未標(biāo)記明顯親子關(guān)系的個(gè)體親緣關(guān)系過近的個(gè)體。
plink --noweb --file 3 --genome --out 4
# 3表示質(zhì)控后的PED和MAP文件
# 輸出:4.genome
FID1 First sample’s family ID
IID1 First sample’s within-family ID
FID2 Second sample’s family ID
IID2 Second sample’s within-family ID
RT Relationship type inferred from .fam/.ped file披泪,從fam/ped文件推斷的樣本關(guān)系 UN
EZ IBD sharing expected value, based on just .fam/.ped relationship NA
Z0 P(IBD=0)
Z1 P(IBD=1)
Z2 P(IBD=2)
PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ) ////// Proportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)纤子,IBD比例
PHE Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs) ////// Pairwise phenotypic code (1, 0, -1 = case-case, case-ctrl, and ctrl-ctrl pairs, respectively)
DST IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs ) ////// IBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2)
PPC IBS binomial test
RATIO Of HETHET : IBS 0 SNPs (expected value is 2)
例如:利用基因組同源相同信息估計(jì)個(gè)體間親緣關(guān)系,在全基因組關(guān)聯(lián)分析等研究中常被用于無關(guān)個(gè)體的挑選款票,部分研究使用的標(biāo)準(zhǔn)為個(gè)體間pi_ hat<0. 125控硼。傳統(tǒng)上,利用系譜信息估計(jì)親緣系數(shù)為0.125時(shí)艾少,反映的是至少3代以上個(gè)體間的親緣關(guān)系卡乾。與此同時(shí),有研究認(rèn)為親緣關(guān)系在3代以外的個(gè)體可以近似為無關(guān)個(gè)體缚够。本研究表明说订,約50%的個(gè)體對間親緣系數(shù)大于等于0.125,即個(gè)體間存在較近的親緣關(guān)系潮瓶。同樣陶冷,利用基因組信息通過計(jì)算基因組純合片段占比發(fā)現(xiàn),約16.67%的樣本近交系數(shù)大于等于0.125毯辅。上述分析表明埂伦, 恩施黑豬個(gè)體間親緣關(guān)系較近,具有明顯的近交累積現(xiàn)象。
[吳林慧,孫琦,王荔茹,張凱麗,謝勝松,李新云,趙書紅,馬云龍.恩施黑豬基因組群體遺傳學(xué)參數(shù)的估計(jì)與選擇信號(hào)研究[J].畜牧獸醫(yī)學(xué)報(bào),2019,50(03):485-494.]
例如:絕大部分個(gè)體間的親緣系數(shù)在0~0.35思恐,親緣系數(shù)最高的為0.579沾谜。[郝鑫宇,馮健,孫延曉,陳琳琳,王起山,潘玉春.保種豬場系譜記錄、管理及群體遺傳結(jié)構(gòu)分析——以萊蕪豬為例[J].中國畜牧雜志,2020,56(04):60-65.]
R計(jì)算近交系數(shù)與親緣系數(shù)
https://blog.csdn.net/yijiaobani/article/details/87876901
https://cloud.tencent.com/developer/article/1550322
1.近交系數(shù)
ID <- c("X","B","A","C","D","E","H","G","I")
Sire <- c("B","E","C","E","E","H",0,0,0)
Dam <- c("A",0,"D",0,"G","G","I","I",0)
ped <- data.frame(ID,Sire,Dam)
ped
install.packages('nadiv')
library(nadiv)
pped =prepPed(ped)
A = as.matrix(makeA(pped));A
re = data.frame(ID=row.names(A),inbreeding = diag(A)-1)
write.csv(re,"inbreeding.csv",row.names=F)
2.親緣系數(shù)
id <- c(289,135,181,90,188,49)
sire <- c(135,90,49,16,0,16)
dam <- c(181,188,188,0,0,0)
ped <- data.frame(id,sire,dam)
ped
library(nadiv)
pped = prepPed(ped)
mat = as.matrix(makeA(pped))
# 135 VS 181
mat[5,6]/sqrt(mat[5,5]*mat[6,6])
# 289 VS 16
mat[1,7]/sqrt(mat[1,1]*mat[7,7])
id = row.names(mat)
id1 = rep(id,length(id))
id2 = rep(id,each=length(id))
coeff = matrix(0,dim(mat)[1],dim(mat)[1])
for(i in 1:dim(mat)[1]){
for(j in 1:dim(mat)[1]){
coeff[i,j] = mat[i,j]/sqrt(mat[i,i]*mat[j,j])
}
}
coeff_value = as.vector(coeff)
re = data.frame(id1,id2,coeff_value)
write.csv(re,"coeff.csv",row.names=F)
IBD與IBS
IBD全稱Identity By Descent, 又叫做血緣同源胀莹,指的是兩個(gè)個(gè)體中共有的等位基因來源于共同祖先基跑。
IBS全稱Identity By State, 又叫做狀態(tài)同源,指的是兩個(gè)個(gè)體中共有的等位基因序列相同描焰。IBS的遺傳距離可以衡量樣本 間的相似性媳否,評價(jià)其親緣關(guān)系。
對于某個(gè)等位基因荆秦,IBS state只要求allel的個(gè)數(shù)相同即可篱竭,而IBD state則進(jìn)一步要求相同的allel來自于共同祖先。
1.iBD
https://blog.csdn.net/weixin_43569478/article/details/108079637
iBD
# 利用IBD可以描述兩個(gè)樣本間的親緣關(guān)系步绸,采用plink計(jì)算(IBD)PI_HAT:
nohup /home/software/plink --allow-extra-chr --chr-set 29 --noweb --file test --genome --allow-no-sex --out test-ibd &
# 理想狀態(tài)下父子關(guān)系的兩個(gè)樣本掺逼,Z0, Z1, Z2對應(yīng)的值分別為0,1, 0,所有位點(diǎn)的一個(gè)allel都繼承自父本瓤介;同卵雙胞胎的兩個(gè)樣本吕喘,則為0,0,1赘那,所有的allel都來自共同的祖先,對于異卵雙胞胎氯质,則為0.25,0.5,0.25
# PI_HAT這個(gè)統(tǒng)計(jì)量的取值范圍為0-1募舟,數(shù)值越大,兩個(gè)樣本的親緣關(guān)系越近病梢,當(dāng)為1時(shí)胃珍,表示的就是同卵雙胞胎,或者重復(fù)樣本蜓陌,可以根據(jù)這個(gè)值篩選親緣關(guān)系近的樣本進(jìn)行過濾觅彰。
2.iBS
https://blog.csdn.net/weixin_43569478/article/details/108079152
http://www.reibang.com/p/d555dc28c3c0?from=singlemessage
iBS
# 在家系數(shù)據(jù)中,由于有父代的分型數(shù)據(jù)钮热, IBD運(yùn)用的很多填抬,在自然群體中,則通常使用IBS隧期。
# 基于SNP分型結(jié)果飒责, 我們可以計(jì)算樣本間的IBS距離。
# 距離可以衡量樣本間的相似性仆潮,根據(jù)IBS distance距離矩陣宏蛉,可以對樣本進(jìn)行MDS分析。
# 通過MDS分析對樣本構(gòu)成可以有一個(gè)清楚的認(rèn)識(shí)性置,也可以用于剔除明顯偏離群體的樣本拾并。在實(shí)際的數(shù)據(jù)分析中,可以借助plink軟件來計(jì)算IBS距離矩陣鹏浅,用法如下:
nohup /home/software/plink --allow-extra-chr --chr-set 29 --file QC.project_96s_noinclude0.recode-502502.bed-geno02-maf03 --cluster --matrix &
# 默認(rèn)情況下會(huì)生成plink.mibs文件嗅义,是一個(gè)距離矩陣,可以用R語言讀取隐砸,然后進(jìn)行MDS分析之碗。R語言中MDS分析的代碼如下R畫圖
m <- as.matrix(read.table("plink.mibs"))
mds <- cmdscale(as.dist(1-m))
k <- c( rep("green",45) , rep("blue",44) )
plot(mds,pch=20,col=k)
legend("topleft", legend = c("CHB", "JPB"), pch = 20, col = c("green", "blue"))
計(jì)算SNP位點(diǎn)雜合度:期望雜合度 觀測雜合度(分群體計(jì)算后,所有的加在一起除以位點(diǎn)數(shù))
nohup /home/software/plink --allow-extra-chr --chr-set 29 --hardy --file com-517-51 --out com-517-51 &
分別計(jì)算
O(HET)
cat test.hwe |awk '{sum+=$7} END {print "Average = ", sum/NR}'
E(HET)
cat test.hwe |awk '{sum+=$8} END {print "Average = ", sum/NR}'
查看ped文件有多少列
cat filename | awk '{print NF}'
cat filename | awk -F '{print NF}' (非tab鍵文件使用-F)