一狸棍、 實(shí)驗(yàn)內(nèi)容
利用SAM篩選差異基因
比較篩選差異基因的方法:倍數(shù)法,t檢驗(yàn)味悄,SAM
二草戈、 實(shí)驗(yàn)?zāi)康?/p>
- 掌握篩選差異基因的方法,為了探究倍數(shù)法侍瑟,t檢驗(yàn)唐片,SAM這三種篩選差異基因的方法異同和優(yōu)越性,為我們后期自己做項(xiàng)目尋求最優(yōu)的方法
- 原理
- 倍數(shù)法(FC)
FC=Xi/Xc - T檢驗(yàn)
H0:Gene i表達(dá)沒(méi)有差異
P<閾值,拒絕H0 -
SAM(圖1)
圖1
三涨颜、 實(shí)驗(yàn)數(shù)據(jù)工具及步驟
- 實(shí)驗(yàn)數(shù)據(jù)
從GEO數(shù)據(jù)庫(kù)下載GSE52327_series_matrix.txt,樣本均是乳腺癌费韭,樣本分類按照表達(dá)醛脫氫酶(ALDH)在原代人乳房的細(xì)胞群含量,實(shí)驗(yàn)組是ALDH+咐低,對(duì)照組是ALDH-
label.txt,
GPL570.txt平臺(tái)信息具體三列,probe ID, Gene Symbol, ENTREZ_GENE_ID - 工具
R.3.6.3(limma,samr) - 步驟
- SAM
[1] 下載數(shù)據(jù)并導(dǎo)入
[2] 根據(jù)平臺(tái)文件揽思,探針對(duì)應(yīng)基因,一對(duì)多见擦,一對(duì)空钉汗,多對(duì)多去掉,多對(duì)一取均值
[3] 根據(jù)注釋信息分組并添加標(biāo)簽鲤屡,ALDH+(8個(gè))损痰,ALDH-(8個(gè)),(1,0)
[4] 利用SAM法篩選差異基因,得到分析結(jié)果 - 三種方法比較
[1] 根據(jù)上面探針對(duì)應(yīng)基因后的表達(dá)譜用FC和t-test進(jìn)行差異表達(dá)分析
[2] 設(shè)定不同的閾值,|logFC|<1,2 p<0.1,0.01,0.05,0.001,分別得到差異基因個(gè)數(shù)
[3] 比較三種方法,利用Venn圖展示出來(lái)(|logFC|<2,p<0.1)
[4] 畫火山圖,將差異基因可視化(y:log2FC,x:adjust_p)
[5] 熱圖以t-test為例
#數(shù)據(jù)處理,探針對(duì)基因
setwd("F:\\實(shí)驗(yàn)\\案例分析\\生物信息學(xué)案例分析\\實(shí)驗(yàn)一")
probe_exp<-read.table("GSE52327_series_matrix.txt",header=T,sep="\t",row.names=1) #讀基因表達(dá)文件 54675*16
probeid_geneid<-read.table("GPL570_probe_geneid.txt",header=T,sep="\t") #讀探針文件
label<-read.table("label.txt",as.is=T,header=T,sep="\t")
probe_name<-rownames(probe_exp)
loc<-match(probeid_geneid[,1],probe_name)#按照probe_exp的probe進(jìn)行匹配(長(zhǎng)度與probeid_geneid[,1]相同酒来,返回平臺(tái)文件probe在表達(dá)譜probe_name位置)
probe_exp<-probe_exp[loc,]#確定能匹配上的probe表達(dá)值(把探針對(duì)基因(空)刪除了)
raw_geneid<-as.numeric(as.matrix(probeid_geneid[,3]))
index<-which(!is.na(raw_geneid)) #找出有g(shù)eneid的probeid并建立索引
geneid<-raw_geneid[index] #提取有g(shù)eneid的probe
exp_matrix<-probe_exp[index,] #找到每個(gè)geneid的表達(dá)值
geneidfactor<-factor(geneid)
gene_exp_matrix<-apply(exp_matrix,2,function(x) tapply(x,geneidfactor,mean)) #多個(gè)探針對(duì)應(yīng)1個(gè)基因的情況卢未,取平均值
rownames(gene_exp_matrix)<-levels(geneidfactor) #geneid作為行名
geneid<-rownames(gene_exp_matrix)
gene_exp_matrix2<-cbind(geneid,gene_exp_matrix)
write.table(gene_exp_matrix2,file="geneid_exp.txt",sep="\t",quote=F,row.names=F)
#分組
index1<-which(label=="aldh status: ALDH+")#8
label[index1,]<-1
index2<-which(label=="aldh status: ALDH-")#8
label[index2,]<-0
#數(shù)據(jù)準(zhǔn)備
label<-as.numeric(as.matrix(label))
exp<-gene_exp_matrix
gid<-rownames(gene_exp_matrix)
gid<-as.numeric(gid)
if(!require("BiocManager"))
install.packages("BiocManager",update = F,ask = F)
BiocManager::install("samr")
library("samr")
###SAM
source("detec.slab2.r")
source("samr.compute.delta.table.zhang.r")
rownames(exp)<-gid
colnames(exp)<-label
label[label!=1]<-2
data=list(x=exp,y=label, geneid=rownames(exp),genenames=paste("g",as.character(1:nrow(exp)),sep=""), logged2=TRUE)
samr.obj<-samr(data, resp.type="Two class unpaired", nperms=1000)
source("num_sig_zhang.R") #計(jì)算檢驗(yàn)統(tǒng)計(jì)量
ensemble_zhang<-num_sig_zhang(data,samr.obj,fdr=c(0.001,0.01,0.05,0.10))
#得到不同閾值下:差異基因個(gè)數(shù),檢驗(yàn)統(tǒng)計(jì)量的參數(shù),d(i),基因表達(dá)情況,寫出
write.csv(ensemble_zhang[[1]],file="br_sig_num.csv")
write.csv(ensemble_zhang[[2]],file="br_sig_gene.csv")
write.csv(samr.obj$tt,file="br_d_stat.csv")
write.csv(ensemble_zhang[[3]],file="br_cut_inf.csv")
2.
#FC&T檢驗(yàn)
setwd("F:\\實(shí)驗(yàn)\\案例分析\\生物信息學(xué)案例分析\\實(shí)驗(yàn)一")
exp<-read.table("geneid_exp.txt",header=T,sep="\t",row.names=1)
label<-read.table("label.txt",as.is=T,header=T,sep="\t")
library(limma)
rt<-exp
index1<-which(label=="aldh status: ALDH+")#8
index2<-which(label=="aldh status: ALDH-")#8
hi_exp<-rt[,index1]
lo_exp<-rt[,index2]
rt<-cbind(hi_exp,lo_exp)
#differential
class<-c(rep("ALDH+",8),rep("ALDH-",8))
design<-model.matrix(~factor(class))
colnames(design)<-c("hi","lo")
fit<-lmFit(rt,design)
fit2<-eBayes(fit)
allDiff=topTable(fit2,adjust='fdr',coef=2,number=200000)
write.table(allDiff,file="limmaTab.txt",sep="\t",quote=F)
#通過(guò)不同閾值,得到差異基因列表
foldChange=1
padj=0.05
#UP log2FC>=1
#down log2FC<=-1
foldChange=1
padj=0.05
diff_FC<-allDiff[with(allDiff, (logFC>foldChange|logFC<(-foldChange))),] #1518
diffup<-allDiff[with(allDiff, ((logFC>foldChange))),] #910
diffdown<-allDiff[with(allDiff,((logFC<(-foldChange)))),] #608
diff_p<-allDiff[with(allDiff, (adj.P.Val<padj)),] #147
dim(diff_p)
dim(diffdown)
dim(diffup)
dim(diff_FC)
write.table(diff_p,file="diff_p.txt",sep="\t",quote=F)
write.table(diff_FC,file="diff_FC.txt",sep="\t",quote=F)
write.table(diffup,file="up.txt",sep="\t",quote=F)
write.table(diffdown,file="down.txt",sep="\t",quote=F)
#Venn圖,查看三種方法結(jié)果的交疊情況
gid<-rownames(rt)
str(ensemble_zhang[[2]])
diffgene_sam<-gid[which(ensemble_zhang[[2]]!=0)]
a<-as.numeric(rownames(diff_FC))
b<-as.numeric(rownames(diff_p))
library("VennDiagram")
venn.diagram(list(FoldChange_2=a,Ttest_0.05=b,SAM_0.1=diffgene_sam),resolution = 300,
imagetype = "tiff",
alpha=c(0.5,0.5,0.5),fill=c("red","yellow","blue"),
cat.fontface=4,fontfamily=3,
main="FoldChange&T-test&SAM",
main.cex = 2, main.fontface = 2, main.fontfamily = 3,
filename = "Venn.tif")
#volcano
pdf(file="vol.pdf")
xMax=max(-log10(allDiff$adj.P.Val)) #x軸范圍為P.value最值
yMax=max(abs(allDiff$logFC)) #y軸范圍為foldchange最值
plot(-log10(allDiff$adj.P.Val),allDiff$logFC,xlab="adj.P.Val",ylab="logFC",main="volcano") #把所有的foldchange值和P.value打黑點(diǎn)
diffSub1=subset(allDiff,allDiff$adj.P.Val<0.05&(allDiff$logFC)>1)
diffSub2=subset(allDiff,allDiff$adj.P.Val<0.05&(allDiff$logFC)<(-1))
points(-log10(diffSub1$adj.P.Val),diffSub1$logFC,pch=20,col="red",cex=0.4)
points(-log10(diffSub2$adj.P.Val),diffSub2$logFC,pch=20,col="green",cex=0.4)
#打紅點(diǎn)綠點(diǎn)
abline(h=0,lty=2,lwd=3) #畫中線
dev.off()
###熱圖
setwd("F:\\實(shí)驗(yàn)\\案例分析\\實(shí)驗(yàn)五")
exp<-read.table("geneid_exp.txt",header=T,sep="\t",row.names=1)
label<-read.table("label.txt",header=T,sep="\t")
index1<-which(label==1)#93
index2<-which(label==0)#54
exp1<-exp[,index1]
exp2<-exp[,index2]
exp<-cbind(exp1,exp2)
loc<-match(a,rownames(exp))
exp11<-exp[loc,]#13*4113
annotation_col= data.frame(sample= factor(rep(c("aldh status: ALDH+", "aldh status: ALDH-"), c(8,8)))) #列名
rownames(annotation_col) = colnames(exp)
ann_colors = list(value = c("aldh status: ALDH+"= "#7570B3","aldh status: ALDH-"="#E574AE"))
pheatmap(exp1, annotation_col = annotation_col,
annotation_colors = ann_colors,color=colorRampPalette(colors=c("navy", "white","firebrick3"))(50),
main="heatmap of diffgene by t-test")
實(shí)驗(yàn)結(jié)果:
1.SAM 結(jié)果:得到4個(gè)表格如圖2
圖 2 閾值在在p的0.001,0.01,0.05,0.10,得到的差異基因個(gè)數(shù)
檢驗(yàn)統(tǒng)計(jì)量的參數(shù),分別p的閾值是0.001,0.01,0.05,0.10辽社,Delta是d(i)-dE(i),med false pos是錯(cuò)誤率中位數(shù)伟墙,第三列90%置信區(qū)間錯(cuò)誤率,called是篩選出的差異基因數(shù)滴铅,第五列錯(cuò)誤發(fā)現(xiàn)率中位數(shù)戳葵,第六列90%置信區(qū)間錯(cuò)誤發(fā)現(xiàn)率,最后兩列統(tǒng)計(jì)量閾值(最大最泻撼住)
d(i)值
基因表達(dá)情況(正數(shù)上調(diào),負(fù)數(shù)下調(diào))
方法比較
三種方法比較
A:Venn圖,查看基因交疊情況,|log2FC|>2, T-test p<=0.1/0.05, SAM p<=0.1 B:條形圖 C~F: 熱圖拱烁,三種方法以及三種方法的交集得出的差異基因表達(dá)情況
火山圖(差異基因可視化,紅色上調(diào),綠色下調(diào))
結(jié)果分析:
從上面所有統(tǒng)計(jì)結(jié)果來(lái)看,三種方法中,FC法篩選出的差異基因最多,所以更沒(méi)有參考性,SAM和t檢驗(yàn)結(jié)果有較高的一致性,但在運(yùn)算上,SAM算法要更加優(yōu)秀,篩選出的差異基因也會(huì)更準(zhǔn)確,但是運(yùn)算量大,當(dāng)然,FC值可以用來(lái)判斷基因表達(dá)上調(diào)還是下調(diào).所以,在時(shí)間允許下篩選差異基因首選SAM法