limma分析差異基因
在經(jīng)過了前兩期中的數(shù)據(jù)下載,數(shù)據(jù)基本處理之后青扔,解決了一個探針對應多個基因數(shù)的
以及多個探針對應一個基因求平均值访敌,在此基礎(chǔ)上運用limma包分析差異基因
除此以外腊嗡,包括繪制火山圖,熱圖拧簸,PCA等劲绪,都在本文中解決
數(shù)據(jù)載入
if(T){
Sys.setlocale('LC_ALL','C')
library(dplyr)
##
if(T){
load("expma.Rdata")
load("probe.Rdata")
}
expma[1:5,1:5]
boxplot(expma)##看下表達情況
metdata[1:5,1:5]
head(probe)
## 查看Gene Symbol是否有重復
table(duplicated(probe$`Gene Symbol`))##12549 FALSE
## 整合注釋信息到表達矩陣
ID<-rownames(expma)
expma<-apply(expma,1,function(x){log2(x+1)})
expma<-t(expma)
eset<-expma[ID %in% probe$ID,] %>% cbind(probe)
eset[1:5,1:5]
colnames(eset)
}
## [1] "GSM188013" "GSM188014" "GSM188016" "GSM188018"
## [5] "GSM188020" "GSM188022" "ID" "Gene Symbol"
## [9] "ENTREZ_GENE_ID"
多個探針求平均值
test1<-aggregate(x=eset[,1:6],by=list(eset$`Gene Symbol`),FUN=mean,na.rm=T)
test1[1:5,1:5]##與去重結(jié)果相吻合
## Group.1 GSM188013 GSM188014 GSM188016 GSM188018
## 1 8.438846 8.368513 7.322442 7.813573
## 2 A1CF 10.979025 10.616926 9.940773 10.413311
## 3 A2M 6.565276 6.422112 8.142194 5.652593
## 4 A4GALT 7.728628 7.818966 8.679885 7.048563
## 5 A4GNT 10.243388 10.182382 9.391991 8.779887
dim(test1)##
## [1] 12549 7
colnames(test1)
## [1] "Group.1" "GSM188013" "GSM188014" "GSM188016" "GSM188018" "GSM188020"
## [7] "GSM188022"
rownames(test1)<-test1$Group.1
test1<-test1[,-1]
eset_dat<-test1
PCA plot
Principal Component Analysis (PCA)分析使用的是基于R語言的 prcomp() and princomp()函數(shù).
完成PCA分析一般有兩種方法: princomp()使用的是一種的spectral decomposition方法
prcomp() and PCA()[FactoMineR]使用的是SVD法
data<-eset_dat
data[1:5,1:5]##表達矩陣
## GSM188013 GSM188014 GSM188016 GSM188018 GSM188020
## 8.438846 8.368513 7.322442 7.813573 7.244615
## A1CF 10.979025 10.616926 9.940773 10.413311 9.743305
## A2M 6.565276 6.422112 8.142194 5.652593 5.550033
## A4GALT 7.728628 7.818966 8.679885 7.048563 5.929258
## A4GNT 10.243388 10.182382 9.391991 8.779887 9.431585
metdata[1:5,1:5]
## title
## GSM188013 DMSO treated MCF7 breast cancer cells [HG-U133A] Exp 1
## GSM188014 Dioxin treated MCF7 breast cancer cells [HG-U133A] Exp 1
## GSM188016 DMSO treated MCF7 breast cancer cells [HG-U133A] Exp 2
## GSM188018 Dioxin treated MCF7 breast cancer cells [HG-U133A] Exp 2
## GSM188020 DMSO treated MCF7 breast cancer cells [HG-U133A] Exp 3
## geo_accession status submission_date
## GSM188013 GSM188013 Public on May 12 2007 May 08 2007
## GSM188014 GSM188014 Public on May 12 2007 May 08 2007
## GSM188016 GSM188016 Public on May 12 2007 May 08 2007
## GSM188018 GSM188018 Public on May 12 2007 May 08 2007
## GSM188020 GSM188020 Public on May 12 2007 May 08 2007
## last_update_date
## GSM188013 May 11 2007
## GSM188014 May 11 2007
## GSM188016 May 11 2007
## GSM188018 May 11 2007
## GSM188020 May 11 2007
##構(gòu)建group_list
group_list<-rep(c("Treat","Control"),3)
colnames(data)<-group_list
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.5.3
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
## 計算PCA
data<-t(data)##轉(zhuǎn)換數(shù)據(jù)至行為sample,列為gene
data<-as.data.frame(data)##注意數(shù)據(jù)要轉(zhuǎn)換為數(shù)據(jù)框
res.pca <- prcomp(data, scale = TRUE)
##展示主成分對差異的貢獻
fviz_eig(res.pca)
## 可視化結(jié)果
fviz_pca_ind(res.pca,
col.ind = group_list, # 顏色對應group信息
palette = c("#00AFBB", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
ellipse.type = "confidence",
legend.title = "Group",## Legend名稱
repel = TRUE
)
層次聚類圖-聚類結(jié)果也與PCA相似,結(jié)果并不好
聚類分析的結(jié)果也同樣可以進一步美化,但這里不做
計算距離時同樣需進行轉(zhuǎn)置贾富,但在前一步PCA分析中的data已經(jīng)經(jīng)過轉(zhuǎn)置歉眷,故未重復
dd <- dist(data, method = "euclidean")##data是經(jīng)過行列轉(zhuǎn)換的
hc <- hclust(dd, method = "ward.D2")
plot(hc)
##對結(jié)果進行美化
# Convert hclust into a dendrogram and plot
hcd <- as.dendrogram(hc)
# Define nodePar
nodePar <- list(lab.cex = 0.6, pch = c(NA, 19),
cex = 0.7, col = "blue")
# Customized plot; remove labels
plot(hcd, ylab = "Height", nodePar = nodePar, leaflab = "none")
limma分析差異基因
limma包的j具具具體用法參考 limma Users Guide
構(gòu)建分組信息,構(gòu)建好比較矩陣是關(guān)鍵
注意這里的表達矩陣信息 eset_dat是經(jīng)過處理后的颤枪,為轉(zhuǎn)置汗捡,行為gene,列為sample
library(limma)
library(dplyr)
group_list
## [1] "Treat" "Control" "Treat" "Control" "Treat" "Control"
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
design
## Control Treat
## 1 0 1
## 2 1 0
## 3 0 1
## 4 1 0
## 5 0 1
## 6 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(group_list)`
## [1] "contr.treatment"
## 比較信息
contrast.matrix<-makeContrasts("Treat-Control",
levels = design)
contrast.matrix##查看比較矩陣的信息畏纲,這里我們設(shè)置的是Treat Vs. control
## Contrasts
## Levels Treat-Control
## Control -1
## Treat 1
## 擬合模型
fit <- lmFit(eset_dat,design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
DEG<-topTable(fit2, coef=1, n=Inf) %>% na.omit() ## coef比較分組 n基因數(shù)
head(DEG)
## logFC AveExpr t P.Value adj.P.Val B
## ALDH3A1 -3.227263 10.302323 -10.710306 4.482850e-05 0.3134585 -4.048355
## CYP1B1 -3.033684 13.287607 -10.505888 4.995753e-05 0.3134585 -4.049713
## CYP1A1 -9.003353 11.481268 -8.371476 1.762905e-04 0.7374232 -4.069681
## HHLA2 -1.550587 6.595658 -7.443431 3.337672e-04 0.9308066 -4.083411
## SLC7A5 -2.470333 13.628775 -7.298868 3.708688e-04 0.9308066 -4.085966
## TIPARP -1.581274 12.764218 -7.024252 4.552834e-04 0.9522252 -4.091192
dim(DEG)
## [1] 12549 6
save(DEG,file = "DEG_all.Rdata")
繪制火山圖
火山圖其實僅僅是一種可視化的方式扇住,能夠從整體上讓我們對整體的差異分析情況有個了解
篩選到差異基因后,可以直接繪制出火山圖
火山圖的橫坐標為logFC, 縱坐標為-log10(pvalue)盗胀,因此其實理論上講plot即可完成火山圖繪制
最簡單原始的畫法
colnames(DEG)
## [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val" "B"
plot(DEG$logFC,-log10(DEG$P.Value))
高級的畫法
借助于有人開發(fā)的更高級的包艘蹋,用于完成某些特殊的功能,或者更美觀
require(EnhancedVolcano)
EnhancedVolcano(DEG,
lab = rownames(DEG),
x = "logFC",
y = "P.Value",
selectLab = rownames(DEG)[1:5],
xlab = bquote(~Log[2]~ "fold change"),
ylab = bquote(~-Log[10]~italic(P)),
pCutoff = 0.05,## pvalue閾值
FCcutoff = 1,## FC cutoff
xlim = c(-5,5),
transcriptPointSize = 1.8,
transcriptLabSize = 5.0,
colAlpha = 1,
legend=c("NS","Log2 FC"," p-value",
" p-value & Log2 FC"),
legendPosition = "bottom",
legendLabSize = 10,
legendIconSize = 3.0)
繪制熱圖
熱圖的使用比較頻繁票灰,得到差異基因后可以直接繪制熱圖
相對簡單好用的要屬pheatmap包了
管道中的常規(guī)提取需要加上特殊的占位符.
## 首先提取出想要畫的數(shù)據(jù)
head(DEG)
## 提取FC前50
up_50<-DEG %>% as_tibble() %>%
mutate(genename=rownames(DEG)) %>%
dplyr::arrange(desc(logFC)) %>%
.$genename %>% .[1:50] ## 管道符中的提取
## FC低前50
down_50<-DEG %>% as_tibble() %>%
mutate(genename=rownames(DEG)) %>%
dplyr::arrange(logFC) %>%
.$genename %>% .[1:50] ## 管道符中的提取
index<-c(up_50,down_50)
## 開始繪圖-最簡單的圖
library(pheatmap)
pheatmap(eset_dat[index,],show_colnames =F,show_rownames = F)
稍微調(diào)整細節(jié)
index_matrix<-t(scale(t(eset_dat[index,])))##歸一化
index_matrix[index_matrix>1]=1
index_matrix[index_matrix<-1]=-1
head(index_matrix)
## 添加注釋
anno=data.frame(group=group_list)
rownames(anno)=colnames(index_matrix)
anno##注釋信息的數(shù)據(jù)框
pheatmap(index_matrix,
show_colnames =F,
show_rownames = F,
cluster_cols = T,
annotation_col=anno)
本期內(nèi)容就到這里女阀,我是白介素2,下期再見
廣而告之
說一個事米间,鑒于簡書平臺在信息傳播方面有不足之處强品,應粉絲要求,白介素2的個人微信平臺已經(jīng)開啟屈糊,繼續(xù)聊臨床與科研的故事的榛,R語言,數(shù)據(jù)挖掘逻锐,文獻閱讀等內(nèi)容夫晌。當然也不要期望過高,微信平臺目前的定位是作為自己的讀書筆記昧诱,如果對大家有幫助最好撼嗓。如果感興趣, 可以掃碼關(guān)注下止潮。
相關(guān)閱讀:
R語言GEO數(shù)據(jù)挖掘01-數(shù)據(jù)下載及提取表達矩陣
R語言GEO數(shù)據(jù)挖掘02-解決GEO數(shù)據(jù)中的多個探針對應一個基因
如果沒有時間精力學習代碼把敢,推薦了解:零代碼數(shù)據(jù)挖掘課程
轉(zhuǎn)載請注明出處