單細(xì)胞分析不可避免的會(huì)涉及到鑒定不同細(xì)胞類型的功能或比較相同類型在不同狀態(tài)下的功能差異,因此员辩,功能分析在單細(xì)胞分析中相當(dāng)重要.
功能分析的方法很多,這兒我們主要講述3種功能分析方法鸵鸥,一是基于篩選的差異基因奠滑,采用超幾何檢驗(yàn)判斷上調(diào)或下調(diào)基因在哪些GO或KEGG或其它定義的通路富集,代表clusterProfiler包脂男。另一種是不篩選差異基因养叛,而是對(duì)其根據(jù)表達(dá)量或與表型的相關(guān)度排序,然后判斷對(duì)應(yīng)的基因集是否傾向于落在有序列表的頂部或底部宰翅,從而判斷基因集合對(duì)表型差異的影響和篩選有影響的基因子集弃甥,代表GSEA富集分析。GSVA是一種非參數(shù)汁讼,無監(jiān)督的算法,與GSEA不同淆攻,GSVA不需要預(yù)先對(duì)樣本進(jìn)行分組阔墩,可以計(jì)算每個(gè)樣本中特定基因集的富集分?jǐn)?shù)。
下載測試數(shù)據(jù)集:wget https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
這兒我們使用的是pbmc 3k數(shù)據(jù)作為測試瓶珊,其它單細(xì)胞數(shù)據(jù)分析類似啸箫。
解壓:tar -zxvf pbmc3k_filtered_gene_bc_matrices.tar.gz
單細(xì)胞轉(zhuǎn)錄數(shù)據(jù)分析之Scanpy:http://www.reibang.com/p/e22a947e6c60
單細(xì)胞轉(zhuǎn)錄組之Scanpy - 軌跡推斷/擬時(shí)序分析:http://www.reibang.com/p/0b2ca0e0b544
單細(xì)胞空間轉(zhuǎn)錄分析之Seurat:http://www.reibang.com/p/c9a601ced91f
單細(xì)胞空間轉(zhuǎn)錄分析之Seurat-多樣本整合(淺談空間批次):http://www.reibang.com/p/609b04096b79
單細(xì)胞輔助注釋工具-SingleR :http://www.reibang.com/p/1c4abf05cb3e
加載Seurat相關(guān)包:
library(dplyr)
library(Seurat)
library(patchwork)
運(yùn)行Seurat基本流程,獲得細(xì)胞簇:
setwd("/home/wucheng/jianshu/function/data")
pbmc.data <- Read10X(data.dir = "/home/wucheng/jianshu/function/data/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
pdf("Umap.pdf")
DimPlot(pbmc, reduction = "umap")
dev.off()
saveRDS(pbmc,"pbmc.rds")
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) #識(shí)別每一簇的marker genes
一伞芹,使用clusterProfiler包忘苛,GO和KEGG富集分析
安裝clusterProfiler, org.Hs.eg.db包:BiocManager::install('clusterProfiler')
BiocManager::install("org.Hs.eg.db")
加載包
library(ggplot2)
library(clusterProfiler)
library(org.Mm.eg.db) ##加載小鼠
library(org.Hs.eg.db) ##加載人類
具體分析
library(Seurat)
setwd("/home/wucheng/jianshu/function/data")
pbmc <-readRDS("pbmc.rds")
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) ##這兒選取C5和C0唱较,C3簇比較扎唾,做差異分析,識(shí)別差異表達(dá)基因南缓,可根據(jù)實(shí)際具體選擇
head(cluster5.markers, n = 5)
> head(cluster5.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
FCGR3A 7.583625e-209 2.963144 0.975 0.037 1.040018e-204
IFITM3 2.500844e-199 2.698187 0.975 0.046 3.429657e-195
CFD 1.763722e-195 2.362381 0.938 0.037 2.418768e-191
CD68 4.612171e-192 2.087366 0.926 0.036 6.325132e-188
RP11-290F20.3 1.846215e-188 1.886288 0.840 0.016 2.531900e-184
up <-rownames(cluster5.markers[intersect(which(cluster5.markers [,1]<0.05),which(cluster5.markers [,2]>=0.25)),])
down <-rownames(cluster5.markers[intersect(which(cluster5.markers [,1]<0.05),which(cluster5.markers [,2]<=(-0.25))),])
##識(shí)別up 胸遇,down genes
gs = bitr(up, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(gs)
ego.bp = enrichGO(gene=gs$ENTREZID, OrgDb = org.Hs.eg.db,ont= "BP",pAdjustMethod = "BH",pvalueCutoff= 0.05,qvalueCutoff= 0.2,readable= TRUE)
head(ego.bp)
write.csv(ego.bp, file ="Cluster5_up_go.csv") ##輸出up 基因富集的Go term
pdf("Cluster5_up_go.pdf", height =5, width = 10)
print(dotplot(ego.bp, showCategory=10,title="Cluster5 Vs.Cluster03 up gene GoTerm"))
dev.off()
kk <- enrichKEGG(gene= gs$ENTREZID, organism = 'hsa',pvalueCutoff = 0.05)
write.csv(kk,file ="Cluster5_up_kegg.csv") ##輸出up 基因富集的Kegg term
pdf("Cluster5_up_kegg.pdf", height =5, width = 10)
print(dotplot(kk, showCategory=10,title="KEGG_Up_biological")) #展示前十個(gè)條目
dev.off()
gs = bitr(down, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(gs)
ego.bp = enrichGO(gene=gs$ENTREZID, OrgDb = org.Hs.eg.db,ont= "BP",pAdjustMethod = "BH",pvalueCutoff= 0.05,qvalueCutoff= 0.2,readable= TRUE)
head(ego.bp)
write.csv(ego.bp, file ="Cluster5_down_go.csv") ##輸出down 基因富集的Go term
pdf("Cluster5_down_go.pdf", height =5, width = 10)
print(dotplot(ego.bp, showCategory=10,title="Cluster5Vs.Cluster03 down gene GoTerm"))
dev.off()
kk <- enrichKEGG(gene= gs$ENTREZID, organism = 'hsa',pvalueCutoff = 0.05)
write.csv(kk,file ="Cluster5_down_kegg.csv") ##輸出down 基因富集的Kegg term
pdf("Cluster5_down_kegg.pdf", height =5, width = 10)
print(dotplot(kk, showCategory=10,title="KEGG_biological"))
dev.off()
二,GSEA分析
使用R語言來進(jìn)行GSEA分析可以有兩種方法汉形,一個(gè)是fgsea包纸镊,另一個(gè)是使用clusterProfiler包。http://www.gsea-msigdb.org/gsea/downloads.jsp
MSigdb(Molecular Signatures Database)數(shù)據(jù)庫包含了以下9種不同基因的基因概疆,可供下載以及R軟件包載入逗威。
- H:hallmark gene sets---癌癥特征基因集合,共50 gene sets届案。
- C1:positional gene sets---染色體位置基因集合庵楷,共299 gene sets。
- C2:curated gene sets---包含了已知數(shù)據(jù)庫楣颠,文獻(xiàn)等基因集信息尽纽,包含5529 gene sets。
- C3:motif gene sets---調(diào)控靶基因集合童漩,包括miRNA-target以及轉(zhuǎn)錄因子-target調(diào)控模式弄贿,3735 gene sets。
- C4:computational gene sets---計(jì)算機(jī)軟件預(yù)測出來的基因集合矫膨,主要是和癌癥相關(guān)的基因差凹,858 gene sets。
- C5:GO gene sets---Gene Ontology對(duì)應(yīng)的基因集合侧馅,10192 gene sets危尿。
- C6:oncogenic signatures---致癌基因集合,大部分來源于NCBI GEO發(fā)表芯片數(shù)據(jù)馁痴,189 gene sets谊娇。
- C7:immunologic signatures---免疫相關(guān)基因集,4872 gene sets罗晕。
- C8:single cell identitiy gene sets, 302 gene sets
fgsea包分析
導(dǎo)入包:
library(msigdbr) #提供MSigdb數(shù)據(jù)庫基因集
library(fgsea)
library(dplyr)
library(tibble)
library(Seurat)
分析代碼:
markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25, logfc.threshold = 0) ##同樣選取C5和C0济欢,C3簇比較赠堵,做差異分析,這兒輸出所有基因法褥,前端可理解為C5上調(diào)茫叭,后端為C5下調(diào)基因
markers$genes = rownames(markers)
cluster.genes<- markers %>% arrange(desc(avg_logFC)) %>% dplyr::select(genes,avg_logFC) #基因按logFC排序
ranks<- deframe(cluster.genes)
mdb_c2 <- msigdbr(species = "Homo sapiens", category = "C2")## 定義基因集,選取C2
fgsea_sets = mdb_c2 [grep("^KEGG",mdb_c2 $gs_name),] %>% split(x = .$gene_symbol, f = .$gs_name)
length(fgsea_sets)
fgseaRes<- fgsea(fgsea_sets, stats = ranks, nperm = 1000) #運(yùn)行fgsea
p <-ggplot(fgseaRes %>% as_tibble() %>% arrange(desc(NES)) %>% filter(pval < 0.05) %>% head(n= 20), aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill= NES)) +
coord_flip() +
labs(x="KEGG", y="Normalized Enrichment Score",title="KEGG gene sets NES from GSEA") ##輸出差異排秩前20的條目
pdf('GSEA-fgsea.pdf',width=8,height=5)
print(p)
dev.off()
pdf('fgsea_KEGG_PRIMARY_IMMUNODEFICIENCY.pdf',width=8,height=5)
plotEnrichment(fgsea_sets[["KEGG_PRIMARY_IMMUNODEFICIENCY"]],ranks) + labs(title="KEGG_PRIMARY_IMMUNODEFICIENCY") #對(duì)某一特定通路分析
dev.off()
X軸是所有基因的排序半等,這兒大約1200個(gè)揍愁。每個(gè)黑條是該基因集中的基因,這樣我們可以知道基因在排序列表中的位置酱鸭。如果基因集位于預(yù)先排列的基因列表的頂部吗垮,則通過某種度量計(jì)算出富集分?jǐn)?shù)(enrichment score垛吗,ES)凹髓,ES為正。如果基因集位于預(yù)先排列的基因列表的底部怯屉,則ES為負(fù)蔚舀。從以上圖中可以看出多數(shù)基因都落在了峰值之后(綠線峰值)的核心基因集中,表明基因在該數(shù)據(jù)集中被抑制锨络,也就是低表達(dá)赌躺。
clusterProfiler包分析
geneset <- read.gmt("c2.cp.kegg.v7.3.entrez.gmt")
geneset$ont = str_remove(geneset$ont,"HALLMARK_")
markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.1, logfc.threshold = 0)
gs <-bitr(rownames(markers), fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
markers1<-cbind(markers[gs[,1],],gs)
geneList = markers1$avg_logFC
names(geneList) = markers1$ENTREZID
geneList = sort(geneList,decreasing = T)
egmt <- GSEA(geneList, TERM2GENE=geneset,verbose=F)
egmt1<- setReadable(egmt,OrgDb=org.Hs.eg.db, keyType = "ENTREZID")
y=data.frame(egmt2)
pdf("dotplot.pdf")#氣泡圖,展示geneset被激活還是抑制
dotplot(egmt1,split=".sign")+facet_grid(~.sign)
dev.off()
library(enrichplot)
for(i in seq_along(egmt@result$ID)){
p <- gseaplot2(egmt, geneSetID = i, title = egmt@result$ID[i])
filename <- paste0('GSEA', egmt@result$ID[i], '.png')
ggsave(filename = filename, p, width = 8, height = 5)
}
三羡儿, GSVA分析
GSVA要求輸入表達(dá)矩陣和基因集列表礼患。表達(dá)矩陣從seurat對(duì)象導(dǎo)入即可
導(dǎo)入相關(guān)包
library(ggplot2)
library(dplyr)
library(msigdbr)
library(Seurat)
library(GSVA)
library(pheatmap)
library(patchwork)
具體代碼
setwd("/home/wucheng/jianshu/function/data")
pbmc <-readRDS("pbmc.rds")
expr <- as.data.frame(pbmc@assays$RNA@data) #表達(dá)矩陣
meta <- pbmc@meta.data[,c("orig.ident","seurat_clusters")] #類別
m_df = msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG") #選取物種人類
msigdbr_list = split(x = m_df$gene_symbol, f = m_df$gs_name)
expr=as.matrix(expr)
kegg <- gsva(expr, msigdbr_list, kcdf="Gaussian",method = "gsva",parallel.sz=10) #gsva
pheatmap(kegg, show_rownames=1, show_colnames=0, annotation_col=meta,fontsize_row=5, filename='gsva_heatmap.png', width=15, height=12)#繪制熱圖
es <- data.frame(t(kegg),stringsAsFactors=F) #添加到單細(xì)胞矩陣中,可視化相關(guān)通路的在umap上聚集情況掠归,可理解為一個(gè)通路即一個(gè)基因
scRNA <- AddMetaData(pbmc, es)
p1 <- FeaturePlot(scRNA, features = "KEGG_PRIMARY_BILE_ACID_BIOSYNTHESIS", reduction = 'umap')
p2 <- FeaturePlot(scRNA, features = "KEGG_ETHER_LIPID_METABOLISM", reduction = 'umap')
p3 <- FeaturePlot(scRNA, features = "KEGG_RIBOSOME", reduction = 'umap')
p4 <- FeaturePlot(scRNA, features = "KEGG_ASTHMA", reduction = 'umap')
plotc = (p1|p2)/(p3|p4)
ggsave('gsva_featureplot.png', plotc, width = 10, height = 8) #輸出圖片
##每個(gè)細(xì)胞類別與功能相關(guān)熱圖
meta <- meta %>%arrange(meta$seurat_clusters)
data <- kegg[,rownames(meta)]
group <- factor(meta[,"seurat_clusters"],ordered = F)
data1 <-NULL
for(i in 0:(length(unique(group))-1)){
ind <-which(group==i)
dat <- apply(data[,ind], 1, mean)
data1 <-cbind(data1,dat)
}
colnames(data1) <-c("C0","C1","C2","C3","C4","C5","C6","C7","C8")
result<- t(scale(t(data1)))
result[result>2]=2
result[result<-2]=-2
library(pheatmap)
p <- pheatmap(result[1:20,],
cluster_rows = F,
cluster_cols = F,
show_rownames = T,
show_colnames = T,
color =colorRampPalette(c("blue", "white","red"))(100),
cellwidth = 10, cellheight = 15,
fontsize = 10)
pdf(("gsva_celltype.pdf"),width = 7,height = 7)
print(p)
dev.off()
假設(shè)為兩組對(duì)照實(shí)驗(yàn)缅叠,我們希望看到一些通路(功能是否顯著),則可使用下面方法
DD1 <- subset(pbmc,idents = c("3","6")) # 對(duì)照組放在前面
expr <- as.data.frame(DD1@assays$RNA@data)
meta <- DD1@meta.data[,c("seurat_clusters")]
m_df = msigdbr(species = "Homo sapiens", category = "C2")
msigdbr_list = split(x = m_df$gene_symbol, f = m_df$gs_name)
expr=as.matrix(expr)
kegg <- gsva(expr, msigdbr_list, kcdf="Gaussian",method = "gsva",parallel.sz=10)
library(limma) ##limma做差異分析
exprSet <- kegg
group <- factor(meta,levels = c("3","6"),ordered = F)## 分組變成向量虏冻,并且限定leves的順序肤粱, levels里面,把對(duì)照組放在前面
design <- model.matrix(~group)# 構(gòu)建比較矩陣
colnames(design) <- levels(group)
fit <- lmFit(exprSet,design)#線性模型擬合
fit1 <- eBayes(fit)#貝葉斯檢驗(yàn)
allDiff=topTable(fit1,adjust='fdr',coef=2,number=Inf)
write.table(allDiff,"kegg.txt",col.names=T,row.names=T,sep="\t")
#根據(jù)allDiff找出你自己感興趣的通路,例如
up <- c("REACTOME_SEROTONIN_NEUROTRANSMITTER_RELEASE_CYCLE",'REACTOME_DOPAMINE_NEUROTRANSMITTER_RELEASE_CYCLE','REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE','REACTOME_NEUROTOXICITY_OF_CLOSTRIDIUM_TOXINS','REACTOME_GABA_SYNTHESIS_RELEASE_REUPTAKE_AND_DEGRADATIO')
down <- c('REACTOME_CREB1_PHOSPHORYLATION_THROUGH_NMDA_RECEPTOR_MEDIATED_ACTIVATION_OF_RAS_SIGNALING','REACTOME_RAS_ACTIVATION_UPON_CA2_INFLUX_THROUGH_NMDA_RECEPTOR','REACTOME_LONG_TERM_POTENTIATION','REACTOME_UNBLOCKING_OF_NMDA_RECEPTORS_GLUTAMATE_BINDING_AND_ACTIVATION','REACTOME_CLEC7A_DECTIN_1_INDUCES_NFAT_ACTIVATIO')
TEST <- c(up,down)
p <- allDiff
p$ID <- rownames(p)
q <- p[TEST,]
group1 <- c(rep("6",5),rep("3",5))
df <- data.frame(ID = q$ID, score = q$t,group=group1 )
# 按照score排序
sortdf <- df[order(df$score),]
sortdf$ID <- factor(sortdf$ID, levels = sortdf$ID)#增加通路ID那一列
head(sortdf)
p <- ggplot(sortdf, aes(ID, score,fill=group)) + geom_bar(stat = 'identity') +
coord_flip() +
theme_bw() + #去除背景色
theme(panel.grid =element_blank())+
theme(panel.border = element_rect(size = 0.6))
pdf("gsva_dif.pdf",width=12,height=8)
p
dev.off()
希望大家關(guān)注點(diǎn)贊厨相,謝謝A炻!BJ尽!<酢5サ蟆!R粽;眉睢R锵痢!