學(xué)習(xí)了一段時(shí)間的R殴胧,開(kāi)始做些題订咸,來(lái)加深所學(xué)的知識(shí)盖灸。
1.根據(jù)R包org.Hs.eg.db找到下面ensembl 基因ID 對(duì)應(yīng)的基因名(symbol)
1.新建TXT,將題中基因復(fù)制到TXT届囚,并命名有梆。然后導(dǎo)入到R。
a<-data.table::fread("1.txt",header=F)
2.導(dǎo)入R后發(fā)現(xiàn)得進(jìn)行小數(shù)點(diǎn)分割意系,由于lappy和str_split比較復(fù)雜泥耀,因此采用separate函數(shù),
a2 <- separate(a,V1,into = "ensembl_id",sep = "[.]")
3.分割后剩下的就是和參考基因組進(jìn)行對(duì)接
ibrary(org.Hs.eg.db)
g2s=toTable(org.Hs.egSYMBOL)
g2e=toTable(org.Hs.egENSEMBL)
c<-merge(g2e,a2,by="ensembl_id")
d<-merge(c,g2s,by="gene_id")
2.根據(jù)R包hgu133a.db找到下面探針對(duì)應(yīng)的基因名(symbol)
思路和上一題一樣
a<-data.table::fread("1.txt",header=F)
colnames(a)<-c("probe_id")
d<-merge(a,ids,by="probe_id")
3.找到R包CLL內(nèi)置的數(shù)據(jù)集的表達(dá)矩陣?yán)锩娴腡P53基因的表達(dá)量蛔添,并且繪制在 progres.-stable分組的boxplot圖
此題是應(yīng)了解什么是expressionSet,
rm(list=ls())
suppressPackageStartupMessages(library(CLL))
data(sCLLex)
sCLLex
exprSet=exprs(sCLLex)
library(hgu95av2.db)
pd=pData(sCLLex)#查看對(duì)像里的分組信息
ids=toTable(hgu95av2SYMBOL)
head(ids)
boxplot(exprSet["1939_at",]~pd$Disease)
4.找到BRCA1基因在TCGA數(shù)據(jù)庫(kù)的乳腺癌數(shù)據(jù)集(Breast Invasive Carcinoma (TCGA, PanCancer Atlas))的表達(dá)情況
cbioportal地址:http://www.cbioportal.org/
1.按照網(wǎng)址下載所需要的數(shù)據(jù)plot.txt
2.將下載的數(shù)據(jù)導(dǎo)入到R痰催,進(jìn)行整理
a <- data.table::fread("plot.txt")
colnames(a) = c('id','subtype','expression','mutant')
suppressPackageStartupMessages(library(ggstatsplot))
ggbetweenstats(data = BRCA1, x="subtype", y="expression")
image.png
5.找到TP53基因在TCGA數(shù)據(jù)庫(kù)的乳腺癌數(shù)據(jù)集的表達(dá)量分組看其是否影響生存提示使用:http://www.oncolnc.org/
按照B站jimmy大神的操作找到源數(shù)據(jù)文件兜辞,導(dǎo)入進(jìn)R后作圖
a<-data.table::fread("BRCA_7157_50_50.csv")
class(a)
View(a)
ggbetweenstats(data = a, x='Status', y='Expression')
image.png
6.下載數(shù)據(jù)集GSE17215的表達(dá)矩陣并且提取下面的基因畫(huà)熱圖
ACTR3B ANLN BAG1 BCL2 BIRC5 BLVRA CCNB1 CCNE1 CDC20 CDC6 CDCA1 CDH3 CENPF CEP55 CXXC5 EGFR ERBB2 ESR1 EXO1 FGFR4 FOXA1 FOXC1 GPR160 GRB7 KIF2C KNTC2 KRT14 KRT17 KRT5 MAPT MDM2 MELK MIA MKI67 MLPH MMP11 MYBL2 MYC NAT1 ORC6L PGR PHGDH PTTG1 RRM2 SFRP1 SLC39A6 TMEM45B TYMS UBE2C UBE2T
suppressPackageStartupMessages(library(GEOquery))
gse17215 = getGEO("GSE17215", destdir=".", AnnotGPL = F, getGPL = F)
class(gse17215)
length(gse17215)
class(gse17215[[1]])
a=gse17215[[1]]
dat<-exprs(a)
class(dat)
dim(dat)
head(dat)
library(hgu133a.db)
ids=toTable(hgu133aSYMBOL)
colnames(data)="probe_id"
ids$median=apply(dat,1,median)
dat=dat[ids$probe_id,]
ids$median=apply(dat,1,median)
ids=ids[order(ids$symbol,ids$median,decreasing = T),]
ids=ids[!duplicated(ids$symbol),]
dat=dat[ids$probe_id,]
rownames(dat)=ids$symbol
ng='ACTR3B ANLN BAG1 BCL2 BIRC5 BLVRA CCNB1 CCNE1 CDC20
CDC6 CDCA1 CDH3 CENPF CEP55 CXXC5 EGFR ERBB2 ESR1 EXO1
FGFR4 FOXA1 FOXC1 GPR160 GRB7 KIF2C KNTC2 KRT14 KRT17 KRT5
MAPT MDM2 MELK MIA MKI67 MLPH MMP11 MYBL2 MYC NAT1
ORC6L PGR PHGDH PTTG1 RRM2 SFRP1 SLC39A6 TMEM45B TYMS UBE2C UBE2T'
ng=strsplit(ng,' ')[[1]]
table(ng %in% rownames(dat))
ng=ng[ng %in% rownames(dat)]
dat=dat[ng,]
dat=log2(dat)
pheatmap::pheatmap(dat,scale = 'row')
image.png
7.下載數(shù)據(jù)集GSE24673的表達(dá)矩陣計(jì)算樣本的相關(guān)性并且繪制熱圖,需要標(biāo)記上樣本分組信息
gse24673 = getGEO("GSE24673", destdir=".", AnnotGPL = F, getGPL = F)
class(gse24673)
a=gse24673[[1]]
dat=exprs(a)
class(a)
dim(dat)
pd=pData(a)
group_list=c('rbc','rbc','rbc',
'rbn','rbn','rbn',
'rbc','rbc','rbc',
'normal','normal')
# 相關(guān)性分析并作圖
M=cor(dat)
View(M)
pheatmap::pheatmap(M)
image.png
tmp=data.frame(g=group_list)
class(tmp)
View(tmp)
rownames(tmp)=colnames(M)
pheatmap::pheatmap(M,annotation_col = tmp)
image.png
8.找到 GPL6244 platform of Affymetrix Human Gene 1.0 ST Array 對(duì)應(yīng)的R的bioconductor注釋包陨囊,并且安裝它弦疮!
hugene10sttranscriptcluster
BiocManager::install("hugene10sttranscriptcluster.db")
9.下載數(shù)據(jù)集GSE42872的表達(dá)矩陣夹攒,并且分別挑選出 所有樣本的(平均表達(dá)量/sd/mad/)最大的探針蜘醋,并且找到它們對(duì)應(yīng)的基因
表達(dá)矩陣的下載已經(jīng)看過(guò)很多次了,但是還是忘
rm(list = ls())
options(stringsAsFactors = F)
library(GEOquery)
gset <- getGEO('GSE42872', destdir=".",
AnnotGPL = F,
getGPL = F)
a=gset[[1]]
dat=exprs(a)
pd=pData(a)
# 分別挑選出 所有樣本的(平均表達(dá)量/sd/mad/)最大的探針
sort(apply(dat,1,mean),decreasing = T)[1]
sort(apply(dat,1,sd),decreasing = T)[1]
sort(apply(dat,1,mad),decreasing = T)[1]
# 加載bioconduct注釋包
library(hugene10sttranscriptcluster.db)
ls("package:hugene10sttranscriptcluster.db")
ggg <-toTable(hugene10sttranscriptclusterSYMBOL)
# 找一下咏尝;尷尬的是7978905找不到...
ggg2 <- ggg[ggg$probe_id%in%7978905,]
ggg2 <- ggg[ggg$probe_id%in%8133876,]
后來(lái)發(fā)現(xiàn)基因注釋包和表達(dá)矩陣就沒(méi)有相同的探針
10.下載數(shù)據(jù)集GSE42872的表達(dá)矩陣压语,并且根據(jù)分組使用limma做差異分析,得到差異結(jié)果矩陣编检。
rm(list = ls())
options(stringsAsFactors = F)
library(GEOquery)
gset <- getGEO('GSE42872', destdir=".",
AnnotGPL = F,
getGPL = F)
class(gset)
a<-gset[[1]]
b<-exprs(a)
pd=pData(a)
library(stringr)
group_list=unlist(lapply(pd$title,function(x){
str_split(x,' ')[[1]][4]
}))
exprSet=b
suppressMessages(library(limma))
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exprSet)
contrast.matrix<-makeContrasts(paste0(unique(group_list),collapse = "-"),levels = design)
contrast.matrix
fit <- lmFit(exprSet,design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
tempOutput = topTable(fit2, coef=1, n=Inf)
nrDEG = na.omit(tempOutput)
head(nrDEG)
image.png