針對(duì)某個(gè)基因敲除的斑馬魚反镇,在12-25天會(huì)逐漸死亡奋蔚,因此我們選擇了8dpf 和16dpf 的由于進(jìn)行轉(zhuǎn)錄組測(cè)序,根據(jù)差異基因我們繪制了如下的韋恩圖联贩,然后我們需要重點(diǎn)關(guān)注灰色區(qū)域的546個(gè)差異基因的部分漫仆,GO分析和KEGG分析
一、術(shù)語(Terminology)
(1)基因集(gene set)和通路(pathway):基因集是功能相關(guān)的基因的無序集合泪幌。忽略基因間的功能關(guān)系盲厌,可以將通路解釋為一個(gè)基因集。
(2)Gene Ontology(GO):GO定義了描述基因功能的概念/類(concepts/classes)祸泪。它將功能分為三個(gè)方面:
MF: Molecular Function - molecular activities of gene products
CC: Cellular Component - where gene products are active
BP: Biological Process - pathways and larger processes made up of the activities of multiple gene products
GO terms組織在一個(gè)有向無環(huán)圖(directed acyclic graph)中吗浩,terms之間的邊表示父子關(guān)系(parent-child relationship)。
(3)Kyoto Encyclopedia of Genes and Genomes(KEGG):KEGG是手工繪制的代表分子相互作用和反應(yīng)網(wǎng)絡(luò)的路徑圖(pathway maps)的集合没隘。這些途徑涵蓋了廣泛的生化過程懂扼,可分為7大類:新陳代謝、遺傳和環(huán)境信息處理右蒲、細(xì)胞過程阀湿、機(jī)體系統(tǒng)、人類疾病和藥物開發(fā)(metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development)瑰妄。
鏈接:http://www.reibang.com/p/d484003dced5
1.利用python找出這546個(gè)基因陷嘴,并寫入csv文件中。(因?yàn)镽學(xué)的不好间坐,不知道如何獲取546個(gè)基因)
from xlrd import open_workbook
import xlrd
import requests
import re
import json
import time
import os
import xlwt
import xlwt
def together_and_apart_from(A,B,C,D,a,b,c,d): #####找到C灾挨,D集合之間的交集,并排除A,B的集合, 還需要去掉重復(fù)部分。
AandB = []
for i in C:
if i in D:
if i not in A and i not in B:
if i not in AandB:
AandB.append(i)
print(len(AandB))
return AandB
######
4個(gè)差異基因文件的路徑
path_excel_8_Hv8_C = r'I:\trit1\T202111_8dpf_and_16dpf\revised 更換了個(gè)野生型16d\Summary\coverage_gene\T8_HVST8_C_Gene_differential_expression.xls'
path_excel_16_Cv8_C = r'I:\trit1\T202111_8dpf_and_16dpf\revised 更換了個(gè)野生型16d\Summary\coverage_gene\T16_CVST8_C_Gene_differential_expression.xls'
path_excel_16_Hv8_H = r'I:\trit1\T202111_8dpf_and_16dpf\revised 更換了個(gè)野生型16d\Summary\coverage_gene\T16_HVST8_H_Gene_differential_expression.xls'
path_excel_16_Hv16_C = r'I:\trit1\T202111_8dpf_and_16dpf\revised 更換了個(gè)野生型16d\Summary\coverage_gene\T16_HVST16_C_Gene_differential_expression.xls'
讀取表格竹宋,選中第一列g(shù)ene_name
workbook_8_Hv8_C = xlrd.open_workbook(path_excel_8_Hv8_C )
sheet1 = workbook_8_Hv8_C.sheet_by_index(0)
col_gene_name_8_Hv8_C = sheet1.col_values(1)
workbook_16_Cv8_C= xlrd.open_workbook(path_excel_16_Cv8_C)
sheet1 = workbook_16_Cv8_C.sheet_by_index(0)
col_gene_name_16_Cv8_C = sheet1.col_values(1)
workbook_16_Hv8_H = xlrd.open_workbook(path_excel_16_Hv8_H )
sheet1 = workbook_16_Hv8_H.sheet_by_index(0)
col_gene_name_16_Hv8_H = sheet1.col_values(1)
workbook_16_Hv16_C= xlrd.open_workbook(path_excel_16_Hv16_C)
sheet1 = workbook_16_Hv16_C.sheet_by_index(0)
col_gene_name_16_Hv16_C = sheet1.col_values(1)
print(len(col_gene_name_8_Hv8_C ),len(col_gene_name_16_Cv8_C),len(col_gene_name_16_Hv8_H),len(col_gene_name_16_Hv16_C))
A=col_gene_name_8_Hv8_C
B=col_gene_name_16_Cv8_C
C=col_gene_name_16_Hv8_H
D=col_gene_name_16_Hv16_C
sets=together_and_apart_from(A, B, C, D, a, b, c, d)
print(sets))
寫入表格
style = xlwt.XFStyle()
font = xlwt.Font()
font.name = 'SimSun'
style.font = font
w = xlwt.Workbook(encoding='utf-8')
# 添加個(gè)sheet
ws = w.add_sheet('sheet 1', cell_overwrite_ok=True)
i=0
for x in sets:
ws.write(i, 0, x)
i=i+1
w.save('四維韋恩圖挑選的交集.xls')
打開保存的xls文件劳澄,在第一行插入加入gene_name
轉(zhuǎn)到R
setwd("I:/trit1/T202111_8dpf_and_16dpf/revised 更換了個(gè)野生型16d/Summary/coverage_gene")
T16<-read.csv("T16_HVST16_C_Gene_differential_expression.csv") ##1360行,蜈七,22列
four_lane<-read.csv("四維韋恩圖挑選的交集.csv") ##546行
four_lane
T=merge(T16,four_lane,by = 'gene_name')
## 554 entries, 22 total columns
T1=merge(T16,four_lane,by = 'gene_name',all=TRUE)
## 1,360 entries, 22 total columns
T2=merge(T16,four_lane,by = 'gene_name',all=FALSE)
##554 entries, 22 total columns
所以T和T2得到的交集是所需要的表
T2=merge(T16,T8,by = c('EntrezID','transcript_id'))##兩個(gè)矩陣以某多列進(jìn)行合并秒拔,共有的序列
2. GO分析
安裝需要的包
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("clusterProfiler")
BiocManager::install("topGO")
BiocManager::install("pathview")
BiocManager::install("org.Dr.eg.db")
BiocManager::install("ggnewscale")
library(ggnewscale)
library(clusterProfiler) #用來做富集分析
library(topGO)#畫GO圖用的
library(pathview) #看KEGG pathway的
library(enrichplot)
library(org.Hs.eg.db)#這個(gè)包里存有人的注釋文件
library(org.Dr.eg.db)#這個(gè)包里存有斑馬魚的注釋文件
library(ggplot2)
library(DOSE)
library(GO.db)
在Rstiduo中安裝自己的orgDb包
OrgDB注釋包
載入R包并大致查看一下R包的內(nèi)容
library(org.Dr.eg.db)#這個(gè)包里存有斑馬魚的注釋文件
columns(org.Dr.eg.db)
columns(org.Dr.eg.db)
columns(org.Hs.eg.db)
可用keys()命令查看一下這個(gè)包大致有哪些genes。
可用select()命令查一下某個(gè)基因?qū)?yīng)的GO和Pathway信息
select(org.Dr.eg.db,keys = '30068',columns=c('GO','PATH'))
做GO分析是不能直接用基因名的宪潮,必須得先轉(zhuǎn)化成entre id, 公司的測(cè)序數(shù)據(jù)有ENtrezID 列的數(shù)據(jù)溯警,但是后來發(fā)現(xiàn)很多基因名沒有對(duì)應(yīng)的entre id趣苏,所以利用返回來數(shù)據(jù)中的gene_id 重新生成entre id。
獲取最新entrez ID梯轻,利用gene_id進(jìn)行轉(zhuǎn)換
ensembl_gene_id=T$gene_id
id <-bitr(ensembl_gene_id, fromType = "ENSEMBL",
toType = c("ENTREZID"),
OrgDb = org.Dr.eg.db,drop = FALSE )
ENTREZ_ID = id$ENTREZID
獲取entrez ID
gene_name=T[,1] #基因名
log2foldchange=T[,17] #log2foldchange列
ENTREZ_ID = id$ENTREZID #EntrezID列
T的數(shù)據(jù)格式
readable=TRUE代表將基因ID轉(zhuǎn)換為gene symbol
BP層面上的富集分析:
go_bp<-enrichGO(gene =ENTREZ_ID ,OrgDb = org.Dr.eg.db, keyType='ENTREZID', ont = "BP", pAdjustMethod = "BH",pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable=TRUE)
write.csv(go_bp@result,"go_bp.csv")
dim(go_bp)# 148 9
CC層面上的富集分析:
go_cc<-enrichGO(gene = ENTREZ_ID,OrgDb = org.Dr.eg.db,keyType = 'ENTREZID', ont = "CC", pAdjustMethod = "BH",pvalueCutoff = 0.05, qvalueCutoff = 0.05)
把結(jié)果導(dǎo)出保存
write.csv(go_bp@result,"go_bp.csv")
dim(go_cc)##33 9
go_MF <- enrichGO(gene =ENTREZ_ID, OrgDb= org.Dr.eg.db, keyType = 'ENTREZID', ont = "MF",pAdjustMethod = "BH",pvalueCutoff = 0.05,qvalueCutoff = 0.05)
write.csv(go_MF@result,"go_mf.csv")
dim(go_MF)##6 9
all包括三個(gè)組分的分析
go_all<-enrichGO(gene = ENTREZ_ID,OrgDb = org.Dr.eg.db,keyType = 'ENTREZID', ont = "ALL", pAdjustMethod = "BH",pvalueCutoff = 0.05, qvalueCutoff = 0.05,readable = TRUE)
write.csv(go_all@result,"go_all.csv")
dim(go_all)# 187 10
sum(go_all$ONTOLOGY=="BP") #148
sum(go_all$ONTOLOGY=="CC") ##33
sum(go_all$ONTOLOGY=="MF")##6
go_cc 保存結(jié)果
reduce redundancy of enriched GO terms
GO以parent-child結(jié)構(gòu)組織食磕,因此父術(shù)語與所有子術(shù)語有很大比例的重疊。這可能導(dǎo)致冗余的結(jié)果喳挑。為了解決這一問題彬伦,clusterProfiler實(shí)現(xiàn)了簡(jiǎn)化方法simplify,以減少enrichGO 和gseGO產(chǎn)生的冗余的GO術(shù)語伊诵。函數(shù)內(nèi)部稱為GOSemSim (Yu et al. 2010)单绑,用于計(jì)算GO項(xiàng)之間的語義相似度,并通過保留一個(gè)代表性項(xiàng)來去除高度相似的項(xiàng)曹宴。
egosimp <- simplify(go_bp,cutoff=0.7,by="p.adjust",select_fun=min,measure="Wang")
write.csv(egosimp@result,"gosimp.csv")
head(egosimp);
dim(egosimp)# 57 9
方法1:基于它們的共有父條目的注釋統(tǒng)計(jì),計(jì)算語義相似性得分
包含Resnik搂橙、Lin、Jiang 和Schlicker四種方法
方法2:基于GO圖形結(jié)構(gòu),Wang
進(jìn)行GO terms集的相似性分析時(shí)一般采取基于Resnik和Lin兩種方法的綜合方法,簡(jiǎn)稱為simRel方法
鏈接:http://www.reibang.com/p/d484003dced5
可視化
做成氣泡圖形式
dotplot(object,x="GeneRatio",color="p.adjust",showCategory=10,
size=NULL,split=NULL,font.size=12,title="",...)
橫軸為GeneRatio笛坦,代表該GO term下富集到的基因個(gè)數(shù)占列表基因總數(shù)的比例
縱軸為富集到的GO Terms的描述信息区转,showCategory指定展示的GO Terms的個(gè)數(shù)
dotplot(egoall,showCategory=10)
dotplot(go_bp,show,Category=20)
dotplot(go_cc)
dotplot(go_MF)
做成條形圖形式
barplot(go_bp, showCategory = 12,title="The GO_BP enrichment analysis of all DEGs ", drop=TRUE)
三個(gè)圖匯聚在一塊
dotplot(go_all,title='Top5 GO terms of each sub-class',showCategory=5,split='ONTOLOGY')+facet_grid(ONTOLOGY~.,scale="free")
前五條通路的共同基因的circle圖
對(duì)于基因和富集的GO terms之間的對(duì)應(yīng)關(guān)系進(jìn)行展示圖中灰色的點(diǎn)代表基因,黃色的點(diǎn)代表富集到的GO terms,如果一個(gè)基因位于一個(gè)GO Terms下版扩,則將該基因與GO連線,黃色節(jié)點(diǎn)的大小對(duì)應(yīng)富集到的基因個(gè)數(shù)废离,默認(rèn)畫top5富集到的GO terms
geneList=log2foldchange
names(geneList)=gene_name
geneList=sort(geneList,decreasing = T)
#(3)展示top5通路的共同基因,要放大看礁芦。
#Gene-Concept Network
對(duì)于基因和富集的GO terms之間的對(duì)應(yīng)關(guān)系進(jìn)行展示蜻韭,如果一個(gè)基因位于一個(gè)GO Terms下,則將該基因與GO連線柿扣,用法如下
cnetplot(go_bp, showCategory=5)
cnetplot(go_bp, categorySize="pvalue", foldChange=geneList,colorEdge = TRUE)
#圓形布局肖方,給線條上色
cnetplot(go_bp, foldChange=geneList, circular = TRUE, colorEdge = TRUE)
圖中灰色的點(diǎn)代表基因,黃色的點(diǎn)代表富集到的GO terms, 默認(rèn)畫top5富集到的GO terms, GO 節(jié)點(diǎn)的大小對(duì)應(yīng)富集到的基因個(gè)數(shù)窄刘。更多用法和細(xì)節(jié)請(qǐng)參考官方文檔窥妇。
通路之間的關(guān)系
goplot(go_bp)
Heatmap-like functional classification
基因展示的heatmap
heatplot(go_bp, foldChange = geneList)
太多基因就會(huì)糊舷胜∶浼可通過調(diào)整比例或者減少基因來控制。
pdf("heatplot.pdf",width = 14,height = 5)
heatplot(ego_bp,foldChange = geneList)
dev.off()
readable=TRUE 代表將基因ID轉(zhuǎn)換為gene symbol
go_bp<-enrichGO(gene =ENTREZ_ID ,OrgDb = org.Dr.eg.db, keyType='ENTREZID', ont = "BP", pAdjustMethod = "BH",pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable=TRUE)
cnetplot(go_bp, categorySize="pvalue", foldChange=geneList,colorEdge = TRUE)
加上readable=TRUE 就可以顯示基因的名字
GO有向無環(huán)圖
矩形代表富集到的top10個(gè)GO terms, 顏色從黃色過濾到紅色烹骨,對(duì)應(yīng)p值從大到小翻伺。
plotGOgraph(go_bp)
emapplot(go_bp,showCategory = 30)
對(duì)于富集到的GO terms之間的基因重疊關(guān)系進(jìn)行展示,每個(gè)節(jié)點(diǎn)是一個(gè)富集到的GO term,默認(rèn)畫top30個(gè)富集到的GO terms沮焕。節(jié)點(diǎn)大小對(duì)應(yīng)該GO terms下富集到的基因個(gè)數(shù)吨岭,節(jié)點(diǎn)的顏色對(duì)應(yīng)p.adjust的值,紅色小藍(lán)色大峦树,如果兩個(gè)GO terms的差異基因存在重疊辣辫,說明這兩個(gè)節(jié)點(diǎn)存在overlap關(guān)系旦事,用線條連接起來
emapplot(egoMF,showCategory=10)
對(duì)于富集到的GO terms之間的基因重疊關(guān)系進(jìn)行展示,如果兩個(gè)GO terms系的差異基因存在重疊急灭,說明這兩個(gè)節(jié)點(diǎn)存在overlap關(guān)系姐浮,在圖中用線條連接起來,用法如下
emapplot(go_bp,showCategory = 30)會(huì)報(bào)錯(cuò)葬馋,需要經(jīng)過以下代碼運(yùn)行下
x2<-pairwise_termsim(go_bp)
emapplot(x2,showCategory = 30)
每個(gè)節(jié)點(diǎn)是一個(gè)富集到的GO term, 默認(rèn)畫top30個(gè)富集到的GO terms, 節(jié)點(diǎn)大小對(duì)應(yīng)該GO terms下富集到的差異基因個(gè)數(shù)卖鲤,節(jié)點(diǎn)的顏色對(duì)應(yīng)p.adjust的值,從小到大畴嘶,對(duì)應(yīng)藍(lán)色到紅色蛋逾。
emapplot(x2,showCategory = 10)
upsetplot(go_bp)
upset展現(xiàn)多個(gè)數(shù)據(jù)集合的交互關(guān)系的圖形,類似于傳統(tǒng)的venn圖
山脊線圖 ridgeline plot for expression distribution of GSEA result
KEGG分析
KEGG 分析
search_kegg_organism("human", by="common_name")
search_kegg_organism("zebrafish", by="common_name")
gene_kegg<-enrichKEGG(gene =ENTREZ_ID ,organism = 'dre',keyType='kegg', pAdjustMethod = "BH",pvalueCutoff = 0.05, qvalueCutoff = 0.05,use_internal_data = FALSE)
dim(gene_kegg)
dotplot(gene_kegg)
barplot(gene_kegg)
enrichMap(gene_kegg)
cnetplot(gene_kegg, showCategory=5)
#將ENTREZID轉(zhuǎn)化為可讀的gene symbol
eKEGG <- setReadable(gene_kegg, OrgDb = org.Dr.eg.db, keyType="ENTREZID")
cnetplot(eKEGG, showCategory=5)
轉(zhuǎn)換位基因名字后
gene.kegg <- bitr_kegg(ENTREZ_ID,fromType="ncbi-geneid",toType="kegg",organism='dre')
head(gene.kegg)
ekegg <- enrichKEGG(ENTREZ_ID, organism='dre',keyType="ncbi-geneid",pvalueCutoff=0.05,pAdjustMethod='BH',qvalueCutoff=0.2, minGSSize=10,maxGSSize=500,use_internal_data=F)
write.csv(ekegg@result,"ekegg.csv")
head(ekegg,1);dim(ekegg)