ABBA-BABA分析方法最早發(fā)布于2012年NATURE的文章:Butterfly genome reveals promiscuous exchange of mimicry adaptations among species(https://www.nature.com/articles/nature11041)
原理:
簡單來說荐吵,假設(shè)有三個物種看铆,P1,P2,P3,利用它們的全基因組SNPs建樹得到的是下圖中的樹,即P1和P2為單系袒哥,再與P3聚到一起。那么根據(jù)分子鐘原理盆赤,P1與P2的共同祖先從P3中分離出阻塑,而后經(jīng)過相同時間演化出P1和P2,那么以下幾點(diǎn)應(yīng)該成立:
(1)【P1到P3】和【P2到P3】的遺傳距離應(yīng)該是大致相同的(考慮到不同位點(diǎn)的進(jìn)化速率不同诚撵,至少在全基因組范圍缭裆,其平均遺傳距離應(yīng)該是大抵相同的)。
(2)假如P2與P3之間發(fā)生過雜交事件寿烟,那么P2-P3的距離應(yīng)該小于P1-P3的舉例澈驼;反之,如果P1與P3發(fā)生過雜交事件筛武,則P1-P3<P2-P3缝其。
(3)ABBA-BABA顧名思義,將遺傳距離相近的兩物種用同一字母表示(A或B)畅铭,ABBA即為P2-P3之間距離較近(但此時不能下定論氏淑,要通過統(tǒng)計(jì)量確定其兩者是否存在遺傳漸滲),BABA即為P1與P3之間距離較近硕噩。
(4)D統(tǒng)計(jì)(D-statistics)以及Z-score的計(jì)算:假設(shè)使用全基因組SNPs數(shù)據(jù)進(jìn)行研究假残,無論是使用滑窗法(將連續(xù)的數(shù)個SNPs一起分析)還是單個SNPs進(jìn)行研究,最終的D統(tǒng)計(jì)都是:D=(ABBA樹數(shù)量-BABA樹數(shù)量)/(ABBA樹數(shù)量+BABA樹數(shù)量)炉擅』岳粒看到這里也就懂了,如果D統(tǒng)計(jì)量=0谍失,說明總體ABBA和BABA的數(shù)量相同眶俩,不存在明顯的滲入;如果不等于0快鱼,則或許存在滲入颠印。進(jìn)一步計(jì)算Z-score,在本軟件中用的是:
??-score = ??/??????_??????(??)
Z>3和<-3分別是正負(fù)顯著抹竹。
P.S. 其實(shí)Patterson‘s D統(tǒng)計(jì)就是ABBA-BABA統(tǒng)計(jì)线罕。
實(shí)際軟件操作:
Dsuite軟件鏈接:https://github.com/millanek/Dsuite
Publication:Malinsky, M., Matschiner, M. and Svardal, H. (2020), Dsuite ‐ fast D‐statistics and related admixture evidence from VCF files. Mol Ecol Resour. Accepted Author Manuscript. https://doi.org/10.1111/1755-0998.13265
Dsuite的優(yōu)勢:
(1)可以使用SNPs和Indels
(2)個體及種群無上限(可到幾百)
(3)可以設(shè)置genomic sliding windows
(4)可以進(jìn)行全基因組范圍的D統(tǒng)計(jì)量和f4-ratio計(jì)算(f4是一種統(tǒng)計(jì)量,用來驗(yàn)證一棵無根樹是否符合預(yù)想的拓?fù)淝耘校唧w解釋見后文钞楼,類似的還有f3,f2等等)
1袄琳、安裝
$ git clone https://github.com/millanek/Dsuite.git
$ cd Dsuite
$ make
2询件、命令
Commands:
Dtrios Calculate D (ABBA-BABA) and f4-ratio statistics for all possible trios of populations/species
DtriosCombine Combine results from Dtrios runs across genomic regions (e.g. per-chromosome)
Dinvestigate Follow up analyses for trios with significantly elevated D:
calculates f_d, f_dM, and d_f in windows along the genome
Fbranch Calculate D and f statistics for branches on a tree that relates the populations/species
Experimental:
Dquartets Calculate D (ABBA-BABA) and f4-ratio statistics for all possible quartets of populations/species
(no outgroup specified)
Usage:
a) Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
b) Dsuite Dquartets [OPTIONS] INPUT_FILE.vcf SETS.txt
c) Dsuite Dinvestigate [OPTIONS] INPUT_FILE.vcf.gz SETS.txt test_trios.txt
d) Dsuite Fbranch [OPTIONS] TREE_FILE.nwk FVALS_tree.txt
一般應(yīng)該是用Dtrios
3燃乍、輸入文件
(1)vcf文件,可以包含SNPs和Indels的多等位基因位點(diǎn)宛琅,但必須是biallelic的刻蟹。.vcf
(2)個體及種群名錄(Population/species map )SETS.txt
如下圖所示 注意要用tab分割
Ind1 Species1
Ind2 Species1
Ind3 Species2
Ind4 Species2
Ind5 Species3
Ind6 Outgroup
Ind7 Outgroup
Ind8 xxx
... ...
IndN Species_n
想用多少用多少個體,如果中途希望取子集研究嘿辟,只需要把不需要的個體用xxx替代就行座咆。
在Dtrios
命令中至少要指定一個outgroup。
在Dquartets
中不應(yīng)該有outgroup仓洼,因?yàn)檫@個命令是探索所有可能patterns的介陶。
要用另外三個算法的話要輸入額外的文件,這里不再列出色建,可以參考github(見上文鏈接)
4哺呜、運(yùn)行Dtrios
Usage: Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
Calculate the D (ABBA/BABA) and f4-ratio (f_G) statistics for all trios of species in the dataset (the outgroup being fixed)
The results are as definded in Patterson et al. 2012 (equivalent to Durand et al. 2011 when the Outgroup is fixed for the ancestral allele)
The SETS.txt should have two columns: SAMPLE_ID SPECIES_ID
The outgroup (can be multiple samples) should be specified by using the keywork Outgroup in place of the SPECIES_ID
-h, --help display this help and exit
-k, --JKnum (default=20) the number of Jackknife blocks to divide the dataset into; should be at least 20 for the whole dataset
-j, --JKwindow (default=NA) Jackknife block size in number of informative SNPs (as used in v0.2)
when specified, this is used in place of the --JKnum option
-r, --region=start,length (optional) only process a subset of the VCF file
-t, --tree=TREE_FILE.nwk (optional) a file with a tree in the newick format specifying the relationships between populations/species
D and f4-ratio values for trios arranged according to the tree will be output in a file with _tree.txt suffix
-o, --out-prefix=OUT_FILE_PREFIX (optional) the prefix for the files where the results should be written
output will be put in OUT_FILE_PREFIX_BBAA.txt, OUT_FILE_PREFIX_Dmin.txt, OUT_FILE_PREFIX_tree.txt etc.
by default, the prefix is taken from the name of the SETS.txt file
-n, --run-name (optional) run-name will be included in the output file name after the PREFIX
--no-f4-ratio (optional) don't calculate the f4-ratio
-l NUMLINES (optional) the number of lines in the VCF input - required if reading the VCF via a unix pipe
5、總結(jié)
總而言之箕戳,最后的代碼大概就這一行某残,參數(shù)可以都為默認(rèn)
Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
最后得到的BBAA.txt, Dmin.txt, tree.txt就是最終結(jié)果,之后再繪圖陵吸。
如果是將基因組分成若干文件來跑的玻墅,需要注意設(shè)置【-n, --run-name】參數(shù),最后可以用DtriosCombine將combine.txt和combine_stderr.txt合并壮虫。
f4統(tǒng)計(jì)量簡介
直接看下文公式:假設(shè)有A澳厢、B、C囚似、D四個個體剩拢,代表四個類群,aj/bj/cj/dj分別是該類群在位點(diǎn)j的參考基因組基因頻率(reference allele frequency)饶唤⌒旆ィ總位點(diǎn)J個。分母是用來將統(tǒng)計(jì)量正態(tài)化的募狂,類群x可以左右選擇办素,對好的是選擇與感興趣類群關(guān)系較近的population。
同樣的祸穷,f4統(tǒng)計(jì)量最后也要用計(jì)算Zscroe性穿,以正負(fù)3為顯著性的閾值。