本文寫于觀看生信技能樹公眾號(hào)(vx: biotrainee)的七步走純R代碼通過數(shù)據(jù)挖掘復(fù)現(xiàn)一篇實(shí)驗(yàn)文章(第1到6步)一文后废睦,感覺生信技能樹優(yōu)秀學(xué)徒的工作十分吸引人黎茎,就自己動(dòng)手復(fù)現(xiàn)了一次醒陆。
Step00.問題概述
本文的任務(wù)是全代碼復(fù)現(xiàn)一篇paper望浩,標(biāo)題為 :Co-expression networks revealed potential core lncRNAs in the triple-negative breast cancer. PMID:27380926
ref: 生信技能樹--七步走純R代碼通過數(shù)據(jù)挖掘復(fù)現(xiàn)一篇實(shí)驗(yàn)文章(第1到6步)
文章是在8名乳腺癌的患者開展了轉(zhuǎn)錄組測序并分析后作出的侥祭。復(fù)現(xiàn)測序的流程恐怕不太現(xiàn)實(shí)歌馍,但是我們可以通過TCGA數(shù)據(jù)庫中的腫瘤數(shù)據(jù)復(fù)現(xiàn)文章的數(shù)據(jù)分析流程尊勿。
本文的分析流程包括:
下載數(shù)據(jù)
數(shù)據(jù)清洗
質(zhì)量控制
差異分析
注釋mRNA,lncRNA
富集分析
至于WGCNA分析在本文就不再復(fù)現(xiàn)了靶剑,有興趣的同學(xué)也可以查閱生信技能樹的文章七步走純R代碼通過數(shù)據(jù)挖掘復(fù)現(xiàn)一篇實(shí)驗(yàn)文章(第七步WGCNA)
Step01.數(shù)據(jù)下載
-
TCGA database
TCGA數(shù)據(jù)庫上的數(shù)據(jù)下載可以參考生信技能樹上有關(guān)的文章送你一篇TCGA數(shù)據(jù)挖掘文章。在本文中也簡要地復(fù)述下載流程压怠。
首先眠冈,登入UCSC Xena
選擇TCGA的breast cancer data
下載RNAseq表達(dá)矩陣和臨床信息
P.S. 要注意的是在生信技能樹中使用的是GDC的breast cancer dataset,而本文使用的是TCGA 的菌瘫。兩個(gè)dataset分析出來的數(shù)據(jù)差異頗大蜗顽。
-
Ensembl GTF file -- annotation information
在Ensembl的FTP download頁(http://asia.ensembl.org/info/data/ftp/index.html)中,選擇人的GTF文件:
隨后雨让,下載文件“Homo_sapiens.GRCh38.98.chr.gtf.gz”即可雇盖。
Step02.數(shù)據(jù)清洗
該步驟需要從臨床信息中提取中三陰性乳腺癌樣本的臨床信息與表達(dá)矩陣,并將腫瘤樣本與正常樣本進(jìn)行配對(duì)栖忠。
三陰性乳腺癌(Triple-negative breast cancer, TNBC) : 指的是以下三種受體均不表達(dá)的乳腺癌類型:
- 雌激素受體:estrogen receptor (ER) ;
- 孕激素受體:progesterone receptor(PR) ;
- 人類表皮生長因子受體2: HER2/neu
rm(list = ls())
#selecting triple-negative breast cancer samples from phenotype data
#extracting clinical information
p <- read.table('.../data/BRCA_clinicalMatrix',header = T,
sep = '\t',quote = '')
colnames(p)[grep("receptor_status", colnames(p))]
## [1] "breast_carcinoma_estrogen_receptor_status"
## [2] "breast_carcinoma_progesterone_receptor_status"
## [3] "lab_proc_her2_neu_immunohistochemistry_receptor_status"
## [4] "metastatic_breast_carcinoma_estrogen_receptor_status"
## [5] "metastatic_breast_carcinoma_progesterone_receptor_status"
# examining how many triple-negative receptors samples
table(p$breast_carcinoma_estrogen_receptor_status == 'Negative' &
p$breast_carcinoma_progesterone_receptor_status == 'Negative' &
p$lab_proc_her2_neu_immunohistochemistry_receptor_status == 'Negative')
## FALSE TRUE
## 1117 130
# extracting tnbc samples
tnbc_samples <- p[p$breast_carcinoma_estrogen_receptor_status == 'Negative' &
p$breast_carcinoma_progesterone_receptor_status == 'Negative' &
p$lab_proc_her2_neu_immunohistochemistry_receptor_status == 'Negative', ]
在TCGA的命名規(guī)則中樣本名字的第14,15個(gè)字符是以兩位數(shù)字表示的崔挖,其中01-09表示腫瘤樣本贸街,10-16表示正常對(duì)照樣本,具體對(duì)應(yīng)關(guān)系可查看其幫助網(wǎng)頁:https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
因此狸相,在后續(xù)分析中我們分別將對(duì)應(yīng)位置為01的分到tumor group匾浪,11的分為normal group
#pairing tumor samples with normal samples
library(stringr)
tab1 <- tnbc_samples[1:2] # includes 'sampleID' & "AJCC_Stage_nature2012"
tumor <- tab1[substr(tab1$sampleID,14,15) < 10,]
tumor$TCGAID <- str_sub(tumor$sampleID,1,12)
normal <- tab1[!substr(tab1$sampleID,14,15) <10,]
normal$TCGAID <- str_sub(normal$sampleID,1,12)
dim(tumor)
## [1] 117 3
dim(normal)
## [1] 13 3
tnbc_samples_paired <- merge(tumor,normal,by = 'TCGAID') # return samples only have N-T pairing
#because samples 'TCGA-BH-A18V' have been detected twice,
#so we remove the duplicated one 'TCGA-BH-A18V-06'
tnbc_samples_paired <- tnbc_samples_paired[-6, ]
save(tnbc_samples_paired, file = "data/tnbc_samples_paired.Rdata")
#gene expression matrix
rawdata <- read.csv("data/HiSeqV2", sep = '\t', header = T)
rawdata <- as.data.frame(rawdata)
rawdata[1:3,1:3]
## sample TCGA.AR.A5QQ.01 TCGA.D8.A1JA.01
## 1 ARHGEF10L 9.5074 7.4346
## 2 HIF3A 1.5787 3.6607
## 3 RNF17 0.0000 0.6245
tnbc_samples_paired[1,"sampleID.x"]
## [1] TCGA-A7-A4SE-01
#make sampleid suitable for comparing with the id in rawdata
t_idfordata <- tnbc_samples_paired$sampleID.x
t_idfordata <- gsub('-','.',t_idfordata)
tnbc_samples_paired$t_dataid <- t_idfordata
n_idfordata <- tnbc_samples_paired$sampleID.y
n_idfordata <- gsub('-','.',n_idfordata)
tnbc_samples_paired$n_dataid <- n_idfordata
table(colnames(rawdata) %in% tnbc_samples_paired$t_dataid)
## FALSE TRUE
## 1205 14
tab2 <- rawdata[ ,colnames(rawdata) %in% tnbc_samples_paired$t_dataid]
tab3 <- rawdata[ ,colnames(rawdata) %in% tnbc_samples_paired$n_dataid]
tab2 <- tab2[, str_sub(colnames(tab2),1,12) %in% str_sub(colnames(tab3),1,12)]
expr <- cbind(tab2, tab3)
rownames(expr) <- rawdata[ ,1]
expr <- t(expr)
expr[1:3,1:3]
## ARHGEF10L HIF3A RNF17
## TCGA.E2.A1L7.01 9.8265 1.7767 0.0000
## TCGA.BH.A1FC.01 9.6724 2.2705 0.6677
## TCGA.E2.A1LS.01 9.3743 7.8902 0.0000
save(expr,file = "data/TNBC_pair_expr.Rdata")
Step03.質(zhì)量控制
提取表達(dá)矩陣后,我們需要對(duì)提取到的數(shù)據(jù)進(jìn)行質(zhì)量檢測卷哩,看看分組是否正確等等蛋辈。在這里分別使用PCA和聚類的方法對(duì)表達(dá)矩陣進(jìn)行分析。一般而言将谊,兩者之一都可以作為表達(dá)矩陣質(zhì)量分析的可視化結(jié)果冷溶,在此處為了展示方式方法的多樣性,我們都將其進(jìn)行展示尊浓。
# using pca to exmain the data quality
library(factoextra)
library(FactoMineR)
group <- c(rep('tumor',11), rep('normal',11))
expr.pca <- PCA(expr,graph = F)
fviz_pca_ind(expr.pca,
geom.ind = "point",
col.ind = group,
addEllipses = TRUE,
legend.title = "Groups")
#cluster for exmaining the data quality
plot(hclust(dist(expr)))
兩種分析都將tumor和normal group清晰地分開逞频,說明表達(dá)矩陣質(zhì)量良好。
Step04.差異表達(dá)分析
本次差異分析使用DESeq2
進(jìn)行栋齿,由于DESeq2
要求輸入的表達(dá)矩陣數(shù)據(jù)是未標(biāo)準(zhǔn)化前的值苗胀,而TCGA上表達(dá)矩陣的值是進(jìn)行過log2(norm_count+1)校正的,因此在差異分析之前瓦堵,需要進(jìn)行un-normalized
library(DESeq2)
# un-normalization
dat <- as.data.frame(t(expr))
# un-normalization
dat <- 2^dat - 1
dat <- ceiling(dat)
dat[1:3,1:3]
## TCGA.E2.A1L7.01 TCGA.BH.A1FC.01 TCGA.E2.A1LS.01
## ARHGEF10L 907 815 663
## HIF3A 3 4 237
## RNF17 0 1 0
在DESeq2
分析過程中基协,會(huì)將表達(dá)矩陣存儲(chǔ)在dds
對(duì)象中,以存儲(chǔ)中間變量和進(jìn)行一部分計(jì)算菇用。dds
對(duì)象的構(gòu)建需要包括以下幾方面數(shù)據(jù):
- un-normalized expression matrix(or count matrix)
- colData :存儲(chǔ)樣本信息
- design formula:指明在模型中的變量澜驮,并用于估計(jì)模型的離散值和log2 fold changes
# Transforming to dds object
group_list <- factor(rep(c('tumor','normal'), each = 11))
colData <- data.frame(row.names=colnames(dat),
group_list=group_list)
dds <- DESeqDataSetFromMatrix(countData = dat,
colData = colData,
design = ~group_list,
tidy = F)
dim(dds)
## [1] 20530 22
# filtering very low-expression data
table(rowSums(counts(dds)==0))
# keep rows at least have almost 70% samples being detected
keep <- rowSums(counts(dds)==0)< 16
dds <- dds[keep, ]
counts(dds)[1:10,1:3]
dim(dds) # more than 2,000 genes being romoved
## [1] 18423 22
# Performing differential expression analysis
dds <- DESeq(dds)
# Extracting transformed values
vsd <- vst(dds, blind = F)
# specifying the contrast in model using to estimate the fold change and p-value
contrast <- c("group_list","tumor","normal")
dd1 <- results(dds, contrast=contrast, alpha = 0.05)
plotMA(dd1, ylim=c(-2,2))
MA-plot用于可視化fold change與gene counts之間的關(guān)系,默認(rèn)情況下p < 0.1的值會(huì)被標(biāo)紅惋鸥,而超過y軸范圍的值則以三角形表示
lfcShrink
可對(duì)log fold change進(jìn)行矯正以消除低表達(dá)基因帶來的誤差
# lfcShrink
dd3 <- lfcShrink(dds, coef = "group_list_tumor_vs_normal", res=dd1, type='apeglm')
dd3
plotMA(dd3, ylim=c(-2,2))
summary(dd3, alpha = 0.05)
## out of 18423 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 4054, 22%
## LFC < 0 (down) : 2837, 15%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# considering genes which fold change > 2 or <0.5 and adjusted-p <0.05 as significantly differential expressed
sig <- abs(dd3$log2FoldChange)>1 & dd3$padj<0.05
res_sig <- dd3[sig,]
summary(res_sig)
## out of 4215 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 2255, 53%
## LFC < 0 (down) : 1960, 47%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
save(dd3,res_sig,vsd, file = '.../data/TCGA_TNBC_DE.Rdata')
差異分析結(jié)果可視化
# visualization
library(ggplot2)
library(ggthemes)
res <- as.data.frame(dd3)
res$threshold <- as.factor(ifelse(res$padj < 0.05 & abs(res$log2FoldChange) >=log2(2),ifelse(res$log2FoldChange > log2(2) ,'Up','Down'),'Not'))
plot2 <- ggplot(data=res, aes(x=log2FoldChange, y =-log10(padj), colour=threshold,fill=threshold)) +
scale_color_manual(values=c("blue", "grey","red"))+
geom_point(alpha=0.4, size=1.2) +
theme_bw(base_size = 12, base_family = "Times") +
geom_vline(xintercept=c(-0.5,0.5),lty=4,col="grey",lwd=0.6)+
geom_hline(yintercept = -log10(0.05),lty=4,col="grey",lwd=0.6)+
theme(legend.position="right",
panel.grid=element_blank(),
legend.title = element_blank(),
legend.text= element_text(face="bold", color="black",family = "Times", size=8),
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(face="bold", color="black", size=12),
axis.text.y = element_text(face="bold", color="black", size=12),
axis.title.x = element_text(face="bold", color="black", size=12),
axis.title.y = element_text(face="bold",color="black", size=12)) +
labs( x="log2 (Fold Change)",y="-log10 (p-value)")
plot2
Step05.注釋
文章中對(duì)數(shù)據(jù)注釋后分為了mRNA和lncRNA杂穷,并對(duì)兩者分別進(jìn)行了分析。接下來我們也將利用Ensembl的GTF進(jìn)行注釋卦绣。
library(rtracklayer)
library(tidyr)
library(dplyr)
library(pheatmap)
require(org.Hs.eg.db)
gtf1 <- import('data/Homo_sapiens.GRCh38.98.chr.gtf')
gtf_df <- as.data.frame(gtf1)
colnames(gtf_df)
# extracting "gene_id" ,"gene_biotype"
gtf <- gtf_df[,c(10,14)]
head(gtf)
save(gtf,file = "data/Homo_sapiens.GRCh38.98.chr.Rdata")
keytypes(org.Hs.eg.db)
res_sig$gene_names <- rownames(res_sig)
# ID transformation
res_id <- clusterProfiler::bitr(res_sig$gene_names,
fromType = 'SYMBOL',
toType = "ENSEMBL",
OrgDb = 'org.Hs.eg.db')
k <- res_id[res_id$ENSEMBL %in% gtf$gene_id, 2] %>%
match(gtf$gene_id)
id_keep <- gtf[k,]
colnames(res_id) <- c("gene_names",'gene_id')
id_keep <- merge(id_keep, res_id, by='gene_id')
## lncRNA polymorphic_pseudogene
## 30 2
## processed_pseudogene transcribed_processed_pseudogene
## 1 6
## protein_coding snoRNA
## 3650 3
## transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene
## 6 15
res_ord <- as.data.frame(res_sig[order(res_sig$padj),])
# extracting mRNA and lncRNA results respectively
res_mrna <- id_keep[id_keep$gene_biotype=='protein_coding',] %>%
merge(as.data.frame(res_ord), by = "gene_names")
res_lncrna <- id_keep[id_keep$gene_biotype=='lncRNA',] %>%
merge(as.data.frame(res_ord), by = "gene_names")
save(res_ord,res_mrna,res_lncrna, file = '.../data/TCGA_annotation_results.Rdata')
Step06.富集分析
富集分析及其可視化采用clusterProfiler
進(jìn)行耐量,由于kegg識(shí)別的ID為"ENTREZID",因此在分析之前也進(jìn)行了一次轉(zhuǎn)換滤港。同時(shí)廊蜒,在轉(zhuǎn)換的過程中出現(xiàn)了"ENSEMBL"--"ENTREZID" multi-mapping的情況,因此我們移除了冗余的id蜗搔。
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(enrichplot)
deid <- bitr(res_mrna$gene_id,
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = 'org.Hs.eg.db')
## 'select()' returned 1:many mapping between keys and columns
deid <- deid[!duplicated(deid$ENSEMBL),]
# cc,MF not showed
ego_BP <- enrichGO(gene = deid$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)
dotplot(ego_BP, showCategory = 20,font.size = 8)
ego_KEGG <- enrichKEGG(gene = deid$ENTREZID, organism = "hsa",
keyType = 'kegg',
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
minGSSize = 10, maxGSSize = 500,
qvalueCutoff = 0.05,
use_internal_data = FALSE)
dotplot(ego_KEGG, showCategory = 20,font.size = 8)
本次代碼復(fù)現(xiàn)也到此暫告一段落劲藐,可能是由于數(shù)據(jù)集或是分析代碼的改動(dòng)八堡,我們注釋到的顯著差異表達(dá)的lncRNA只有30個(gè)樟凄,遠(yuǎn)遠(yuǎn)小于文章報(bào)道的1,211個(gè)。原則上兄渺,我們應(yīng)先對(duì)差異分析結(jié)果注釋缝龄,再行設(shè)定cutoff以找出顯著的差異表達(dá)基因。但是在先行注釋的情況下,仍然只能在本數(shù)據(jù)集中找到143個(gè)lncRNA叔壤,讓我不禁懷疑不同數(shù)據(jù)集的差異性真的有這么大瞎饲?亦或是由樣本的差異性所導(dǎo)致的?
#dd3是lfcshrink的差異分析結(jié)果
dd3$gene_names <- rownames(dd3)
res_id2 <- clusterProfiler::bitr(dd3$gene_names,
fromType = 'SYMBOL',
toType = "ENSEMBL",
OrgDb = 'org.Hs.eg.db')
dim(res_id2)
## [1] 17830 2
k2 <- res_id2[res_id2$ENSEMBL %in% gtf$gene_id, 2] %>%
match(gtf$gene_id)
id_keep2 <- gtf[k2,]
colnames(res_id2) <- c("gene_names",'gene_id')
id_keep2 <- merge(id_keep2, res_id2, by='gene_id')
table(id_keep2$gene_biotype)
lncRNA misc_RNA
143 1
polymorphic_pseudogene processed_pseudogene
12 12
protein_coding ribozyme
15534 1
scaRNA snoRNA
5 26
TEC TR_C_gene
3 1
transcribed_processed_pseudogene transcribed_unitary_pseudogene
33 14
transcribed_unprocessed_pseudogene unitary_pseudogene
79 1
unprocessed_pseudogene
3
不論分析結(jié)果如何炼绘,本次分析流程也是十分值得學(xué)習(xí)的嗅战。至于文中的疑問在解決后會(huì)回來填坑的!最后俺亮,再次感謝生信技能樹(vx: biotrainee)的分享驮捍,大家快去關(guān)注吧!
補(bǔ)坑
在咨詢生信技能樹的jimmy老師后脚曾,對(duì)本文中的一些問題也得到了解答东且。關(guān)于為何最終注釋的lncRNA較少主要是因?yàn)檫x取的表達(dá)矩陣是RSEM normalized count matrix,該數(shù)據(jù)集含有的non-coding genes 的數(shù)量本來就較少本讥,故能夠注釋到的lncRNA也會(huì)較少珊泳。但在TCGA Breast Cancer (BRCA)的數(shù)據(jù)集中我暫時(shí)還沒發(fā)現(xiàn)到轉(zhuǎn)錄組的表達(dá)矩陣,該數(shù)據(jù)集的RNA-seq數(shù)據(jù)基本上是使用polyA+ IlluminaHiSeq拷沸,意味著測序的基本上都是mRNA色查。miRNA的data倒是有,但整個(gè)轉(zhuǎn)錄組的data還沒找到撞芍,如果有找到的朋友也可以告知我综慎。
補(bǔ)充于2019/10/13
完。