比較基因組學(xué)分析目錄
1:單拷貝基因構(gòu)建物種樹以及計(jì)算分化時(shí)間
2:基因家族收縮與擴(kuò)張分析
3:特異節(jié)點(diǎn)富集分析
1.前言
上次講到基因家族的收縮擴(kuò)張分析,這次我們以某個(gè)節(jié)點(diǎn)為切入點(diǎn),進(jìn)行特異節(jié)點(diǎn)的功能富集分析
其實(shí)這篇教程適用于所有的非模式物種(沒有現(xiàn)成的GO/KEGG庫的物種)的富集分析
物種特異的比較簡(jiǎn)單就不進(jìn)行實(shí)踐了
2.節(jié)點(diǎn)選擇
這次選擇一個(gè)進(jìn)化上比較重要的節(jié)點(diǎn),水生到陸生的過程,對(duì)應(yīng)進(jìn)化樹為node25
3.獲取目的基因
主要是目標(biāo)基因與背景基因的選擇犁珠,目標(biāo)基因可以是該節(jié)點(diǎn)顯著擴(kuò)張的基因家族包含的所有物種基因,背景基因一般為該節(jié)點(diǎn)包含的Orthogroup
3.1 獲取node25顯著擴(kuò)張的基因
cat Gamma_p0.05change.tab | cut -f1,27 | grep "+[1-9]" | cut -f1 > node25significant.expand
#node25顯著擴(kuò)張的orthogroupsID
grep -f node25significant.expand Orthogroups.txt |sed "s/ /\n/g" | grep -E "Ath|Csa|Vvi|Aof|Osa|Zma|Nco|Atr|Tpl|Cri|Smo|Mpo|Ppa" | sort | uniq > node25significant.expand.genes
目前我們得到了需要進(jìn)行富集分析的目標(biāo)基因,接下來是背景基因的選擇
3.2 node25背景基因選擇
awk 'NR!=1 && $27>0 {print $0}' Gamma_count.tab | cut -f1 > node25.orthogroups
#node25中存在基因的orthogroupsID
grep -f node25.orthogroups Orthogroups.txt |sed "s/ /\n/g" | grep -E "Ath|Csa|Vvi|Aof|Osa|Zma|Nco|Atr|Tpl|Cri|Smo|Mpo|Ppa" | sort | uniq > node25.genes
node25.genes和node25significant.expand.genes就是我們進(jìn)行富集分析所需要的基因號(hào)底挫,接下來就是構(gòu)建背景基因的GO和KEGG注釋,由于是無參物種脸侥,所以我們需要自己構(gòu)建
4 背景基因的GO和KEGG注釋
4.1 GO/KEGG注釋
使用eggnog對(duì)背景基因進(jìn)行注釋
seqkit grep -f node25.genes ../../../allpep.fa > node25background.fa
#提取背景基因凄敢,需要嚴(yán)格注意基因數(shù)是否一致
emapper.py -m diamond -i node25background.fa -o node25 --cpu 16
最后我們拿到node25.emapper.annotations文件,將前三行和#刪除
4.2 GO注釋庫構(gòu)建
獲取GO的注釋信息:下載地址:http://purl.obolibrary.org/obo/go/go-basic.obo 之后進(jìn)行修飾
grep "^id:" go-basic.obo |awk '{print $2}' > GO.id
grep "^name:" go-basic.obo |awk '{print $2}' > GO.name
grep "^namespace:" go-basic.obo |awk '{print $2}' > GO.class
paste GO.id GO.name GO.class -d "\t" > GO.library
文件一共有三列
GO:1903040 exon-exon junction complex assembly biological process
GO:1903045 neural crest cell migration involved in sympathetic nervous system development biological process
GO:1903046 meiotic cell cycle process biological process
GO:1903043 positive regulation of chondrocyte hypertrophy biological process
GO:1903044 protein localization to membrane raft biological process
GO:1903049 negative regulation of acetylcholine-gated cation channel activity biological process
GO:1903047 mitotic cell cycle process biological process
GO:1903048 regulation of acetylcholine-gated cation channel activity biological process
GO:1903012 positive regulation of bone development biological process
GO:1903013 response to differentiation-inducing factor 1 biological process
GO:1903010 regulation of bone development biological process
GO:1903011 negative regulation of bone development biological process
GO:1903016 negative regulation of exo-alpha-sialidase activity biological process
GO:1903017 positive regulation of exo-alpha-sialidase activity biological process
GO:1903014 cellular response to differentiation-inducing factor 1 biological process
GO:1903015 regulation of exo-alpha-sialidase activity biological process
GO:1903018 regulation of glycoprotein metabolic process biological process
GO:1903019 negative regulation of glycoprotein metabolic process biological process
GO:1903020 positive regulation of glycoprotein metabolic process biological process
GO:1903023 regulation of ascospore-type prospore membrane assembly biological process
4.3 KEGG注釋庫構(gòu)建
需要下載ko00001.json: https://www.genome.jp/kegg-bin/get_htext?ko00001
# 需要下載 json文件(這是是經(jīng)常更新的)
# https://www.genome.jp/kegg-bin/get_htext?ko00001
# 代碼來自:http://www.genek.tv/course/225/task/4861/show
library(jsonlite)
library(purrr)
library(RCurl)
library(dplyr)
library(stringr)
kegg <- function(json = "ko00001.json") {
pathway2name <- tibble(Pathway = character(), Name = character())
ko2pathway <- tibble(Ko = character(), Pathway = character())
kegg <- fromJSON(json)
for (a in seq_along(kegg[["children"]][["children"]])) {
A <- kegg[["children"]][["name"]][[a]]
for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
B <- kegg[["children"]][["children"]][[a]][["name"]][[b]]
for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
kos <- str_match(kos_info, "K[0-9]*")[,1]
ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
}
}
}
colnames(ko2pathway) <- c("KO","Pathway")
save(pathway2name, ko2pathway, file = "kegg_info.RData")
write.table(pathway2name,"KEGG.library",sep="\t",row.names = F)
}
kegg(json = "ko00001.json")
4.4 背景注釋構(gòu)建
主要是將自己的基因集與注釋庫結(jié)合
library(clusterProfiler)
library(dplyr)
library(stringr)
options(stringsAsFactors = F)
##===============================STEP1:GO注釋生成=======================
#自己構(gòu)建的話湿痢,首先需要讀入文件
egg <- read.delim("node25.emapper.annotations",header = T,sep="\t")
egg[egg==""]<-NA #將空行變成NA涝缝,方便下面的去除
#從文件中挑出基因query與eggnog注釋信息
##gene_info <- egg %>%
## dplyr::select(GID = query, GENENAME = eggNOG_OGs) %>% na.omit()
#挑出query_name與GO注釋信息
gterms <- egg %>%
dplyr::select(query, GOs) %>% na.omit()
gene_ids <- egg$query
eggnog_lines_with_go <- egg$GOs!= ""
eggnog_lines_with_go
eggnog_annoations_go <- str_split(egg[eggnog_lines_with_go,]$GOs, ",")
gene2go <- data.frame(gene = rep(gene_ids[eggnog_lines_with_go],
times = sapply(eggnog_annoations_go, length)),
term = unlist(eggnog_annoations_go))
names(gene2go) <- c('gene_id', 'ID')
go2name <- read.delim('GO.library', header = FALSE, stringsAsFactors = FALSE)
names(go2name) <- c('ID', 'Description', 'Ontology')
go_anno <- merge(gene2go, go2name, by = 'ID', all.x = TRUE)
## 將GO注釋信息保存
save(go_anno,file = "node25_GO.rda")
##===============================STEP2:KEGG注釋生成=======================
gene2ko <- egg %>%
dplyr::select(GID = query, KO = KEGG_ko) %>%
na.omit()
pathway2name <- read.delim("KEGG.library")
colnames(pathway2name)<-c("Pathway","Name")
gene2ko$KO <- str_replace(gene2ko$KO, "ko:","")
gene2pathway <- gene2ko %>% left_join(ko2pathway, by = "KO") %>%
dplyr::select(GID, Pathway) %>%
na.omit()
kegg_anno<- merge(gene2pathway,pathway2name,by = 'Pathway', all.x = TRUE)[,c(2,1,3)]
colnames(kegg_anno) <- c('gene_id','pathway_id','pathway_description')
save(kegg_anno,file = "node25_KEGG.rda")
4.5 GO/KEEG富集分析
利用生成的node25_GO.rda、node25_KEGG.rda譬重、以及之前的node25significant.expand.genes文件進(jìn)行富集分析
我在分析的時(shí)候沒有設(shè)置p的閾值拒逮,輸出了所有的富集結(jié)果,畢竟作圖也只需要TOP10或者TOP20臀规,大家按照需求設(shè)定
##===============================STEP3:GO富集分析=================
#目標(biāo)基因列表(全部基因)
gene_select <- read.delim(file = 'node25significant.expand.genes', stringsAsFactors = FALSE,header = F)$V1
#GO 富集分析
#默認(rèn)以所有注釋到 GO 的基因?yàn)楸尘凹苍部赏ㄟ^ universe 參數(shù)輸入背景集
#默認(rèn)以 p<0.05 為標(biāo)準(zhǔn),Benjamini 方法校正 p 值塔嬉,q 值閾值 0.2
#默認(rèn)輸出 top500 富集結(jié)果
#如果想輸出所有富集結(jié)果(不考慮 p 值閾值等)玩徊,將 p、q 等值設(shè)置為 1 即可
#或者直接在 enrichResult 類對(duì)象中直接提取需要的結(jié)果
go_rich <- enricher(gene = gene_select,
TERM2GENE = go_anno[c('ID', 'gene_id')],
TERM2NAME = go_anno[c('ID', 'Description')],
pvalueCutoff = 1,
pAdjustMethod = 'BH',
qvalueCutoff = 1
)
#輸出默認(rèn)結(jié)果谨究,即根據(jù)上述 p 值等閾值篩選后的
tmp <- merge(go_rich, go2name[c('ID', 'Ontology')], by = 'ID')
tmp <- tmp[c(10, 1:9)]
tmp <- tmp[order(tmp$pvalue), ]
write.table(tmp, 'node6expand.xls', sep = '\t', row.names = FALSE, quote = FALSE)
##===============================STEP4:KEGG注釋=================
gene_select <- read.delim('node25significant.expand.genes', stringsAsFactors = FALSE,header = F)$V1
#KEGG 富集分析
#默認(rèn)以所有注釋到 KEGG 的基因?yàn)楸尘凹鞲ぃ部赏ㄟ^ universe 參數(shù)指定其中的一個(gè)子集作為背景集
#默認(rèn)以 p<0.05 為標(biāo)準(zhǔn),Benjamini 方法校正 p 值胶哲,q 值閾值 0.2
#默認(rèn)輸出 top500 富集結(jié)果
kegg_rich <- enricher(gene = gene_select,
TERM2GENE = kegg_anno[c('pathway_id', 'gene_id')],
TERM2NAME = kegg_anno[c('pathway_id', 'pathway_description')],
pvalueCutoff = 1,
pAdjustMethod = 'BH',
qvalueCutoff = 1,
maxGSSize = 500)
#輸出默認(rèn)結(jié)果畔塔,即根據(jù)上述 p 值等閾值篩選后的
write.table(kegg_rich, 'node25significant.expand.KEGG.xls', sep = '\t', row.names = FALSE, quote = FALSE)
需要注意生成的文件中有一些term可能是重復(fù)的,比如GO富集結(jié)果的16,17行,仔細(xì)看geneID那一列就會(huì)發(fā)現(xiàn)兩行的基因是完全一樣的澈吨,需要?jiǎng)h掉一個(gè)把敢。還有一點(diǎn)就是在GO/KEGG富集的結(jié)果中,有一些是動(dòng)物相關(guān)的(KEGG結(jié)果的第一行)谅辣,這些也請(qǐng)酌情考慮
5 GO/KEGG繪圖
此次使用TOP10的term進(jìn)行繪圖修赞,看一下在陸生植物出現(xiàn)的關(guān)鍵進(jìn)化節(jié)點(diǎn)主要是哪些基因功能發(fā)生了大規(guī)模擴(kuò)張
library(ggplot2)
pathway = read.delim("node25significant.expand.GO.xls",header = T,sep="\t")
pathway$GeneRatio<- pathway[,10]/54700 ##這個(gè)54700是GeneRatio的那個(gè)數(shù)值,自己修改
pathway$log<- -log10(pathway[,6])
library(dplyr)
library(ggrepel)
GO <- arrange(pathway,pathway[,6])
GO_dataset <- GO[1:10,]
#按照PValue從低到高排序[升序]
GO_dataset$Description<- factor(GO_dataset$Description,levels = rev(GO_dataset$Description))
GO_dataset$GeneRatio <- as.numeric(GO_dataset$GeneRatio)
GO_dataset$GeneRatio
GO_dataset$log
#圖片背景設(shè)定
mytheme <- theme(axis.title=element_text(face="bold", size=8,colour = 'black'), #坐標(biāo)軸標(biāo)題
axis.text.y = element_text(face="bold", size=6,colour = 'black'), #坐標(biāo)軸標(biāo)簽
axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1,size=8),
axis.line = element_line(size=0.5, colour = 'black'), #軸線
panel.background = element_rect(color='black'), #繪圖區(qū)邊框
plot.title = element_text(face="bold", size=8,colour = 'black',hjust = 0.8),
legend.key = element_blank() #關(guān)閉圖例邊框
)
#繪制GO氣泡圖
p <- ggplot(GO_dataset,aes(x=GeneRatio,y=Description,colour=log,size=Count,shape=Ontology))+
geom_point()+
scale_size(range=c(2, 8))+
scale_colour_gradient(low = "#52c2eb",high = "#EA4F30")+
theme_bw()+
labs(x='Fold Enrichment',y='GO Terms', #自定義x桑阶、y軸榔组、標(biāo)題內(nèi)容
title = 'Enriched GO Terms')+
labs(color=expression(-log[10](pvalue)))+
theme(legend.title=element_text(size=8), legend.text = element_text(size=14))+
theme(axis.title.y = element_text(margin = margin(r = 50)),axis.title.x = element_text(margin = margin(t = 20)))+
theme(axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1))
plot <- p+mytheme
plot
#保存圖片
ggsave(plot,filename = "node25GO.pdf",width = 210,height = 210,units = "mm",dpi=300)
ggsave(plot,filename = "node25GO.png",width = 210,height = 210,units = "mm",dpi=300)
KEGG同理,大家修改運(yùn)行參數(shù)和輸入文件即可
pathway = read.delim("node25significant.expand.KEGG.xls",header = T,sep="\t")
pathway$GeneRatio<- pathway[,9]/31537
pathway$log<- -log10(pathway[,5])
library(dplyr)
library(ggrepel)
GO <- arrange(pathway,pathway[,5])
GO_dataset <- GO[1:10,]
#Pathway列最好轉(zhuǎn)化成因子型联逻,否則作圖時(shí)ggplot2會(huì)將所有Pathway按字母順序重排序
#將Pathway列轉(zhuǎn)化為因子型
GO_dataset$Description<- factor(GO_dataset$Description,levels = rev(GO_dataset$Description))
GO_dataset$GeneRatio <- as.numeric(GO_dataset$GeneRatio)
GO_dataset<- arrange(GO_dataset,GO_dataset[,5])
GO_dataset$GeneRatio
GO_dataset$log
#圖片背景設(shè)定
mytheme <- theme(axis.title=element_text(face="bold", size=8,colour = 'black'), #坐標(biāo)軸標(biāo)題
axis.text.y = element_text(face="bold", size=6,colour = 'black'), #坐標(biāo)軸標(biāo)簽
axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1,size=8),
axis.line = element_line(size=0.5, colour = 'black'), #軸線
panel.background = element_rect(color='black'), #繪圖區(qū)邊框
plot.title = element_text(face="bold", size=8,colour = 'black',hjust = 0.8),
legend.key = element_blank() #關(guān)閉圖例邊框
)
#繪制KEGG氣泡圖
p <- ggplot(GO_dataset,aes(x=GeneRatio,y=Description,colour=log,size=Count))+
geom_point()+
scale_size(range=c(2, 8))+
scale_colour_gradient(low = "#52c2eb",high = "#EA4F30")+
theme_bw()+
labs(x='Fold Enrichment',y='KEEG Terms', #自定義x搓扯、y軸、標(biāo)題內(nèi)容
title = 'Enriched KEEG Terms')+
labs(color=expression(-log[10](pvalue)))+
theme(legend.title=element_text(size=8), legend.text = element_text(size=14))+
theme(axis.title.y = element_text(margin = margin(r = 50)),axis.title.x = element_text(margin = margin(t = 20)))+
theme(axis.text.x = element_text(face ="bold",color="black",angle=0,vjust=1))
plot <- p+mytheme
plot
#保存圖片
ggsave(plot,filename = "node25KEGG.pdf",width = 210,height = 210,units = "mm",dpi=300)
ggsave(plot,filename = "node25KEGG.png",width = 210,height = 210,units = "mm",dpi=300)
最后我們可以看到包归,在GO富集中锨推,最顯著的是水運(yùn)輸相關(guān)功能,KEGG富集中公壤,最顯著的是晝夜節(jié)律换可,大家覺得陸生植物的進(jìn)化和這兩個(gè)功能有沒有關(guān)系?歡迎討論
6 結(jié)果整理
結(jié)合這三篇推文的結(jié)果厦幅,我們可以整理一張主圖
總結(jié)
比較基因組學(xué)分析的基本內(nèi)容已經(jīng)介紹完了沾鳄,如果說有遺漏的內(nèi)容那應(yīng)該就是加倍化事件,由于本次使用的數(shù)據(jù)集跨度過大所以沒有增加WGD的分析确憨,大家有什么問題歡迎交流
如果覺得有幫助歡迎在博客打賞