目的:用GSVA分析單細胞轉(zhuǎn)錄組數(shù)據(jù)
基因集變異分析(GSVA)是一種非參數(shù)肖粮,無監(jiān)督的方法看杭,用于通過表達數(shù)據(jù)集的樣本估算基因集富集的差異,即基于通路上的差異分析
一、GSVA介紹與使用方法
查看以下網(wǎng)絡(luò)教程:
GSVA的使用
Day 11 充分理解GSVA和GSEA
GSVA + limma進行差異通路分析
因為沒有現(xiàn)成的小鼠gmt底層文件愁溜,需要自己構(gòu)造,這里只重點說明如何分析小鼠單細胞轉(zhuǎn)錄組的數(shù)據(jù)外厂。
二冕象、操作
下載gmt文件
gmt文件格式:h.all.v7.0.symbols.gmt
HALLMARK_TNFA_SIGNALING_VIA_NFKB http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_TNFA_SIGNALING_VIA_NFKB JUNB CXCL2 ATF3 NFKBIA TNFAIP3 PTGS2 CXCL1 IER3 CD83 CCL20 CXCL3 MAFF NFKB2 TNFAIP2 HBEGF KLF6 BIRC3 PLAUR ZFP36 ICAM1 JUN EGR3 IL1B BCL2A1 PPP1R15A ZC3H12A SOD2 NR4A2 IL1A RELB TRAF1 BTG2 DUSP1 MAP3K8 ETS2 F3 SDC4 EGR1 IL6 TNF KDM6B NFKB1 LIF PTX3 FOSL1 NR4A1 JAG1 CCL4 GCH1 CCL2 RCAN1 DUSP2 EHD1 IER2 REL CFLAR RIPK2 NFKBIE NR4A3 PHLDA1 IER5 TNFSF9 GEM GADD45A CXCL10 PLK2 BHLHE40 EGR2 SOCS3 SLC2A6 PTGER4 DUSP5 SERPINB2 NFIL3 SERPINE1 TRIB1 TIPARP RELA BIRC2 CXCL6 LITAF TNFAIP6 CD44 INHBA PLAU MYC TNFRSF9 SGK1 TNIP1 NAMPT FOSL2 PNRC1 ID2 CD69 IL7R EFNA1 PHLDA2 PFKFB3 CCL5 YRDC IFNGR2 SQSTM1 BTG3 GADD45B KYNU G0S2 BTG1 MCL1 VEGFA MAP2K3 CDKN1A CCN1 TANK IFIT2 IL18 TUBB2A IRF1 FOS OLR1 RHOB AREG NINJ1 ZBTB10 PLPP3 KLF4 CXCL11 SAT1 CSF1 GPR183 PMEPA1 PTPRE TLR2 ACKR3 KLF10 MARCKS LAMB3 CEBPB TRIP10 F2RL1 KLF9 LDLR TGIF1 RNF19B DRAM1 B4GALT1 DNAJB4 CSF2 PDE4B SNN PLEK STAT5A DENND5A CCND1 DDX58 SPHK1 CD80 TNFAIP8 CCNL1 FUT4 CCRL2 SPSB1 TSC22D1 B4GALT5 SIK1 CLCF1 NFE2L2 FOSB AC129492.1 NFAT5 ATP2B1 IL12B IL6ST SLC16A6 ABCA1 HES1 BCL6 IRS2 SLC2A3 CEBPD IL23A SMAD3 TAP1 MSC IFIH1 IL15RA TNIP2 BCL3 PANX1 FJX1 EDN1 EIF1 BMP2 DUSP4 PDLIM5 ICOSLG GFPT2 KLF2 TNC SERPINB8 MXD1
HALLMARK_HYPOXIA http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_HYPOXIA PGK1 PDK1 GBE1 PFKL ALDOA ENO2 PGM1 NDRG1 HK2 ALDOC GPI MXI1 SLC2A1 P4HA1 ADM P4HA2 ENO1 PFKP AK4 FAM162A PFKFB3 VEGFA BNIP3L TPI1 ERO1A KDM3A CCNG2 LDHA GYS1 GAPDH BHLHE40 ANGPTL4 JUN SERPINE1 LOX GCK PPFIA4 MAFF DDIT4 SLC2A3 IGFBP3 NFIL3 FOS RBPJ HK1 CITED2 ISG20 GALK1 WSB1 PYGM STC1 ZNF292 BTG1 PLIN2 CSRP2 VLDLR JMJD6 EXT1 F3 PDK3 ANKZF1 UGP2 ALDOB STC2 ERRFI1 ENO3 PNRC1 HMOX1 PGF GAPDHS CHST2 TMEM45A BCAN ATF3 CAV1 AMPD3 GPC3 NDST1 IRS2 SAP30 GAA SDC4 STBD1 IER3 PKLR IGFBP1 PLAUR CAVIN3 CCN5 LARGE1 NOCT S100A4 RRAGD ZFP36 EGFR EDN2 IDS CDKN1A RORA DUSP1 MIF PPP1R3C DPYSL4 KDELR3 DTNA ADORA2B HS3ST1 CAVIN1 NR3C1 KLF6 GPC4 CCN1 TNFAIP3 CA12 HEXA BGN PPP1R15A PGM2 PIM1 PRDX5 NAGK CDKN1B BRS3 TKTL1 MT1E ATP7A MT2A SDC3 TIPARP PKP1 ANXA2 PGAM2 DDIT3 PRKCA SLC37A4 CXCR4 EFNA3 CP KLF7 CCN2 CHST3 TPD52 LXN B4GALNT2 PPARGC1A BCL2 GCNT2 HAS1 KLHL24 SCARB1 SLC25A1 SDC2 CASP6 VHL FOXO3 PDGFB B3GALT6 SLC2A5 SRPX EFNA1 GLRX ACKR3 PAM TGFBI DCN SIAH2 PLAC8 FBP1 TPST2 PHKG1 MYH9 CDKN1C GRHPR PCK1 INHA HSPA5 NDST2 NEDD4L TPBG XPNPEP1 IL6 SLC6A6 MAP3K1 LDHC AKAP12 TES KIF5A LALBA COL5A1 GPC1 HDLBP ILVBL NCAN TGM2 ETS1 HOXB9 SELENBP1 FOSL2 SULT2B1 TGFB3
HALLMARK_CHOLESTEROL_HOMEOSTASIS http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_CHOLESTEROL_HOMEOSTASIS FDPS CYP51A1 IDI1 FDFT1 DHCR7 SQLE HMGCS1 NSDHL LSS MVD LDLR TM7SF2 ALDOC EBP SCD PMVK MVK LPL SC5D FADS2 HMGCR HSD17B7 ANXA13 SREBF2 PCYT2 ACSS2 ATF3 ADH4 ETHE1 ECH1 CBS GUSB FASN LGALS3 ATF5 ANXA5 TP53INP1 CHKA GSTM2 ACAT2 AVPR1A PLSCR1 CLU ERRFI1 TRIB3 CXCL16 TNFRSF12A ACTG1 JAG1 LGMN FBXO6 GPX8 PNRC1 ANTXR2 MAL2 CD9 PPARG GLDC STX5 STARD4 CTNNB1 TMEM97 FAM129A PDK3 PLAUR SEMA3B GNAI1 ABCA2 ATXN2 NFIL3 ALCAM FABP5 S100A11 CPEB2
這個時候就有一個問題:gmt數(shù)據(jù)庫只有人的,如果要改為其他物種(如小鼠)(則需要自己去構(gòu)造一個這樣的底層文件)
下面以構(gòu)建小鼠的gmt文件為例:
思路:
將小鼠與人的同源基因用于替換掉文件每一行中的小鼠基因名(Symbol ID)
從Ensembl下載小鼠與人同源信息的對應(yīng)關(guān)系
目的構(gòu)建一個這樣這樣格式的gmt文件格式的文件
不過由于我們需要的是小鼠的基因名汁蝶,作為后面基因渐扮,所以需要進行如下處理:
1. 到 http://asia.ensembl.org/biomart/中下載小鼠和人的同源基因?qū)?yīng)關(guān)系
-
選擇小鼠的entrezID 與版本號的對應(yīng)關(guān)系
-
選擇人的entrezID與Symbol ID的對應(yīng)關(guān)系
- 得到的是:
mart_export.txt
映射關(guān)系:小鼠Entrez ID --> 人的Symbol ID$head mart_export.txt Gene stable ID Gene stable ID version Human gene stable ID Human gene name Human orthology confidence [0 low, 1 high] ENSMUSG00000064372 ENSMUSG00000064372.1 ENSMUSG00000064371 ENSMUSG00000064371.1 ENSMUSG00000064370 ENSMUSG00000064370.1 ENSG00000198727 MT-CYB 1
尋找小鼠EntrezID映射到小鼠Symbol ID的映射關(guān)系文件
方法有很多,這里用一個最簡單粗暴的方法:直接使用Cellranger跑出來用于矩陣文件的目錄里的features.tsv
文件找到這種映射關(guān)系
head features.tsv
ENSMUSG00000102693 4933401J01Rik Gene Expression
ENSMUSG00000064842 Gm26206 Gene Expression
ENSMUSG00000051951 Xkr4 Gene Expression
ENSMUSG00000102851 Gm18956 Gene Expression
編寫Python腳本
寫腳本將小鼠的gene symbol ID替換掉原本gmt文件中的人對應(yīng)的同源基因 symbol ID(感覺用perl或awk寫會比較簡潔掖棉,當然Python也可以)
- 腳本思路:
- 從
mart_export.txt
中尋找映射關(guān)系:用哈希1存放映射關(guān)系(人gene symbol ID-->小鼠Entrez ID) - 從
features.tsv
中尋找映射關(guān)系:用哈希2存放映射關(guān)系(小鼠Entrez ID-->小鼠symbol ID) - 循環(huán)遍歷gmt文件
h.all.v7.0.symbols.gmt
,將每一行里面的人類gene symbol ID作為鍵去查找相應(yīng)的值(小鼠的Entrez ID):- 若查找不到返回空值;
- 若查找到,則繼續(xù)用哈希1的獲取到值(小鼠的Entrez ID),作為哈希2的鍵獲取相應(yīng)的小鼠Gene symbol ID,替換原本的人類基因symbol ID
- 從
#!/usr/bin/python
#Author:Robin 20200220
#For transfer human to mouse
human_set = set()
h2m = dict()
m2m = dict()
# human gene ID to mouse Entrez ID
with open(./mart_export.txt') as f1:
for line in f1:
line = line.strip('\n')
lst = line.split('\t')
if lst[3]:
if lst[3] in human_set:
h2m[lst[3]].append(lst[0])
else:
h2m[lst[3]] = [ lst[0] ]
human_set.add(lst[3])
# mouse Entrez ID to mouse symbol ID
with open('./features.tsv') as f2:
for line in f2:
line = line.strip('\n')
lst = line.split('\t')
m2m[lst[0]] = lst[1]
# 遍歷gmt文件每行墓律,進行替換
with open('./h.all.v7.0.symbols.gmt') as f3:
with open('./mm.all.v7.0.symbols.gmt', 'w') as f4:
for line in f3:
line = line.strip('\n')
lst = line.split('\t')
name = lst[0]
url = lst[1]
gene_list = lst[2:]
# replace the human gene symbol into mouse gene symbol
tmp_list = []
for hg in gene_list: #each gene in row
mouse_entrez = h2m.get(hg, "")
tmp = []
for entrez in mouse_entrez: #each element in entrez IDs
mouse_gene = m2m[entrez]
tmp.append(mouse_gene)
tmp_list.extend(tmp)
row = [name, url]
row.extend(tmp_list)
row = '\t'.join(row)
row = row + '\n'
print(row)
f4.write(row)
用GSVA將表達矩陣轉(zhuǎn)換成通路矩陣,并生成熱圖
.libPaths(c('/share/nas2/public/R/library/3.6','/home/honghh/R/x86_64-redhat-linux-gnu-library/3.6'))
library(GSVA)
library(GSEABase)
library(limma)
library("org.Hs.eg.db")
library(parallel)
library(Seurat)
options(future.globals.maxSize=9891289600)
wd='/share/nas1/Data/Users/luohb/.../20200220'
dir.create(wd)
setwd(wd)
RSA<-readRDS('/share/nas1/Data/Users/yinl/Project/personality/…/merge_seurat_rna.rds')
Idents(RSA)<-RSA@meta.data$seurat_clusters
c11_seu<-subset(RSA, idents = 11)
sam_list<-c(c11_seu)
name_list<-c('c11_seu')
i<-1
for(sam in sam_list){
name<-name_list[i]
print(name)
#remove Mitochondrial gene
tmp<-as.matrix(sam@assays$RNA@counts)
bool<-!grepl('^mt-',rownames(sam),perl = T)
tmp<-tmp[bool,]
print(class(tmp))
rownames(tmp)<-toupper(rownames(tmp))
count.marix<-tmp
keggSet <- getGmt("/share/nas1/Data/Users/luohb/20191223/mm.all.v7.0.symbols.gmt")
keggEs <- gsva(expr=as.matrix(count.marix), gset.idx.list=keggSet, kcdf="Poisson", parallel.sz=30)
saveRDS(keggEs, file=paste0(name, '_keggEs_filter.rds'))
keggEs_filter<-readRDS(paste0(name, '_keggEs_filter.rds'))
#add annotation
group<-gsub('[1-9]_.*', '',colnames(keggEs_filter))
col_annotation=data.frame(Type=group)
rownames(col_annotation) = colnames(keggEs_filter)
#plot heatmap
p<-pheatmap::pheatmap(keggEs_filter, show_colnames=F, annotation_col=col_annotation)
filename<-paste0(name, '_keggEs_filter.heatmap.pdf')
ggsave(p, filename=filename,
width = 40, height = 20)
i<-i+1
}