單倍型颠焦,是單倍體基因型的簡(jiǎn)稱,在遺傳學(xué)上是指在同一染色體上進(jìn)行共同遺傳的多個(gè)基因座上等位基因的組合粥谬;通俗的說法就是若干個(gè)決定同一性狀的緊密連鎖的基因構(gòu)成的基因型。
單倍型分析舉例:
我們以目的基因X來進(jìn)行實(shí)戰(zhàn)分析
(1) 提取基因X的基因區(qū)SNP
bcftools filter All.vcf.gz --regions chr:start-end > GeneX.vcf
#chr:基因X所在的染色體辫塌;start:基因X起始位點(diǎn)漏策;end:基因X終止位點(diǎn)【拾保可以檢索基因組gff文件獲得
(2) 重新編碼vcf文件的基因型
我們這里主要針對(duì)雙等位基因vcf文件進(jìn)行分析掺喻,vcf中包含./.,0/0储矩,0/1感耙,1/1四種基因型。
編碼對(duì)應(yīng)表(轉(zhuǎn)換后 - 轉(zhuǎn)換前):
- -1 - ./.
- 0 - 0/0
- 1 - 0/1
- 2 - 1/1
使用vcftools可以實(shí)現(xiàn)基因型的轉(zhuǎn)換:
vcftools --vcf test.vcf --012
Java腳本也可以實(shí)現(xiàn)替換基因型
public class HaploType {
public static void main(String[] args) throws IOException {
this.changeVcfCode("input.txt","output,txt");
}
public static void changeVcfCode(String input, String output) throws IOException {
BufferedReader rd = new BufferedReader(new FileReader(new File(input)));
BufferedWriter wt = new BufferedWriter(new FileWriter(new File(output)));
String line;
List<String> listTab;
while ((line = rd.readLine()) != null ) {
if (line.startsWith("#")) continue;
listTab = Arrays.asList(line.split("\t"));
StringBuilder bf = new StringBuilder();
for (int i = 9; i < listTab.size(); i++) {
if(listTab.get(i).startsWith("1/1")){
bf.append("2\t");
}else if(listTab.get(i).startsWith("0/0")){
bf.append("0\t");
}else if(listTab.get(i).startsWith("0/1") | listTab.get(i).startsWith("1/0") ){
bf.append("1\t");
}else if(listTab.get(i).startsWith("./.")){
bf.append("-1\t");
}else{
bf.append(listTab.get(i)+"\t");
}
}
wt.write(bf.toString()+"\n");
}
rd.close();
wt.close();
}
}
生成haplo.txt文件持隧,沒有表頭即硼。
(3) 基因單倍型圖譜
- 輸入文件
(1)haplo.txt:行代表snp位點(diǎn),列代表樣本舆蝴。
(2)Annotation_sample.txt:可選的谦絮,用于注釋單倍型圖譜的樣本信息,行代表樣本洁仗,tab隔開层皱。
#Annotation_sample.txt
Growing_Habit Region
Winter SCA
Spring SCA
Winter EA
- 畫圖
#加載所需的軟件包
library(pheatmap)
require(reshape)
require (rworldmap)
require(rworldxtra)
#加載注釋內(nèi)容,根據(jù)自己的注釋文件進(jìn)行修改,145是樣本數(shù)目
annotation_col <- read.table("Annotation.txt",header=T,stringsAsFactors = T)
rownames(annotation_col) = c(1:145)
labels_col = rep(c(""), 145)
ann_colors = list(
Growing_Habit = c(Facultative = "yellow", Spring="red",Winter="blue"),
Region = c(AM = "#228B22",EA = "#D95F02",SCA="#1E90FF",WA="#E7298A", EU="#000000", AF="#FFD700",EAF="#8B008B")
)
#加載haplotype文件
GeneX <- read.table("haplo.txt", header=F, stringsAsFactors = F)
colnames(GeneX) <- c(1:145)
#畫圖
out <- pheatmap(GeneX, show_colnames=FALSE, legend_breaks = -1:2, legend_labels = c("./.", "0/0", "0/1", "1/1"), cutree_cols=3, cluster_row = FALSE, annotation_col = annotation_col, annotation_colors = ann_colors, main="")
out
(4) 基因單倍型地理分布圖譜
- 輸入文件
(1)haplo.txt:行代表樣本赠潦,列代表snp位點(diǎn)叫胖。
(2)Annotation_sample_location.txt:行代表樣本,tab隔開她奥。
#Annotation_sample_location.txt
Lat Lon
36.65 61.15
33.8 62.2
34.36 111.61
- 畫圖
#加載所需的軟件包
library(pheatmap)
require(reshape)
require (rworldmap)
require(rworldxtra)
#加載樣本經(jīng)緯度信息
latlon <- read.table("latlon.txt",header=T,stringsAsFactors = F, fill=TRUE)
rownames(latlon) <- c(1:145)
#對(duì)樣本經(jīng)緯度進(jìn)行聚類
library(cluster)
library(factoextra)
data2 <- latlon[!is.na(latlon$Lat),]
df = scale(data2)
#先求樣本之間兩兩相似性
result <- dist(df, method = "euclidean")
#產(chǎn)生層次結(jié)構(gòu)
result_hc <- hclust(d = result, method = "ward.D2")
#確定樣本地理位置聚類的數(shù)目瓮增,這里分成了40個(gè)地理group
data2$type <- cutree(result_hc, k=40)
lat_mean <- tapply(data2[,1],data2$type,mean,na.rm = TRUE)
lon_mean <- tapply(data2[,2],data2$type,mean,na.rm = TRUE)
data2$cluster1 <- NA
data2$cluster2 <- NA
#145是樣本數(shù),40是聚類數(shù)目
for(i in 1:145) {
for(j in 1:40){
if(data2[i,3] == j ){
data2[i,4] <- as.numeric(lat_mean[j])
data2[i,5] <- as.numeric(lon_mean[j])
}
}
}
new_latlon <- data2[,c(4,5)]
rownames(new_latlon) <- c(1:145)
#根據(jù)上述單倍型圖譜確定每個(gè)地理group的樣本以及標(biāo)注樣本對(duì)應(yīng)的單倍型(1哩俭,2绷跑,3...)
names <- as.numeric(colnames(GeneX[,out$tree_col[["order"]]]))
dataset <- new_latlon[names,]
new_latlon$id <- rownames(new_latlon)
dataset$sub <- NA
colnames(dataset)[1:3] <- c("Lat", "Lon", "id",)
#下面幾行代碼是確定不同的單倍型cluster的分界線。上面單倍型圖譜中分了三種不同的單倍型凡资,從圖中確定界限的樣本編號(hào)(41, 63, 123)砸捏,根據(jù)下面代碼的返回值為樣本確定單倍型group(1,2,3)
which(dataset$id==41,)
which(dataset$id==63,)
which(dataset$id==123,)
#確定樣本的單倍型類型
dataset$sub[1] <- 1
dataset$sub[2:11] <- 2
dataset$sub[12:145] <- 3
colnames(dataset)[4] <- "Sub.population.3"
#畫圖
dataset$value <- 1
reshape <- cast(dataset,Lat+Lon~Sub.population.3)
reshape2 <- as.data.frame(reshape)
mapPies(reshape2,xlim=c(-120,140),ylim=c(35,40),nameX="Lon",nameY="Lat",nameZs=c('1','2','3'),symbolSize=1, zColours=c('green','blue','orange'),barOrient='vert',oceanCol="#D1EEEE",landCol="#FFDAB9",main="29")
參考文獻(xiàn):
【1】https://www.nature.com/articles/s41588-020-00722-w
【2】https://www.nature.com/articles/s41467-020-18738-5