m6A相關(guān)的lncRNA怎么進(jìn)行計(jì)算胜榔?下面我們就來探究下
1.表達(dá)矩陣下載
UCSC Xena是University of California開發(fā)的針對(duì)多個(gè)數(shù)據(jù)庫(kù)的綜合網(wǎng)站寨典,上面有針對(duì)TCGA數(shù)據(jù)庫(kù)整理好的RNA-seq表達(dá)矩陣宇立。
網(wǎng)址:http://xena.ucsc.edu/
常用的數(shù)據(jù)是FPKM,可以轉(zhuǎn)換為TPM苟弛。
注意:
1.FPKM是log2(FPKM+1)
2.基因注釋版本是gencode.v22.annotation
3.表達(dá)矩陣是ENSG格式
2.區(qū)分mRNA和lncRNA
打開GENCODE中g(shù)tf文件的biotype的說明,可以發(fā)現(xiàn)lncRNA包含9種類型。
對(duì)lncRNA的說明:
Generic long non-coding RNA biotype that replaced the following biotypes: 3prime_overlapping_ncRNA, antisense, bidirectional_promoter_lncRNA, lincRNA, macro_lncRNA, non_coding, processed_transcript, sense_intronic and sense_overlapping.
# 讀取并探索gtf文件
gtf <- rtracklayer::import("D:/jianshu/R-TCGA-m6AlncRNA/gencode.v22.annotation.gtf")
gtf <- as.data.frame(gtf)
head(gtf,6)
## seqnames start end width strand source type score phase
## 1 chr1 11869 14409 2541 + HAVANA gene NA NA
## 2 chr1 11869 14409 2541 + HAVANA transcript NA NA
## 3 chr1 11869 12227 359 + HAVANA exon NA NA
## 4 chr1 12613 12721 109 + HAVANA exon NA NA
## 5 chr1 13221 14409 1189 + HAVANA exon NA NA
## 6 chr1 12010 13670 1661 + HAVANA transcript NA NA
## gene_id gene_type gene_status gene_name
## 1 ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1
## 2 ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1
## 3 ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1
## 4 ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1
## 5 ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1
## 6 ENSG00000223972.5 transcribed_unprocessed_pseudogene KNOWN DDX11L1
## level havana_gene transcript_id
## 1 2 OTTHUMG00000000961.2 <NA>
## 2 2 OTTHUMG00000000961.2 ENST00000456328.2
## 3 2 OTTHUMG00000000961.2 ENST00000456328.2
## 4 2 OTTHUMG00000000961.2 ENST00000456328.2
## 5 2 OTTHUMG00000000961.2 ENST00000456328.2
## 6 2 OTTHUMG00000000961.2 ENST00000450305.2
## transcript_type transcript_status transcript_name tag
## 1 <NA> <NA> <NA> <NA>
## 2 processed_transcript KNOWN DDX11L1-002 basic
## 3 processed_transcript KNOWN DDX11L1-002 basic
## 4 processed_transcript KNOWN DDX11L1-002 basic
## 5 processed_transcript KNOWN DDX11L1-002 basic
## 6 transcribed_unprocessed_pseudogene KNOWN DDX11L1-001 basic
## transcript_support_level havana_transcript exon_number exon_id
## 1 <NA> <NA> <NA> <NA>
## 2 1 OTTHUMT00000362751.1 <NA> <NA>
## 3 1 OTTHUMT00000362751.1 1 ENSE00002234944.1
## 4 1 OTTHUMT00000362751.1 2 ENSE00003582793.1
## 5 1 OTTHUMT00000362751.1 3 ENSE00002312635.1
## 6 NA OTTHUMT00000002844.2 <NA> <NA>
## ont protein_id ccdsid
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 <NA> <NA> <NA>
## 6 PGO:0000019 <NA> <NA>
需要的列有:
gene_id:與TCGA數(shù)據(jù)ENSG格式是一致的
gene_type:區(qū)分lncRNA
gene_name:即gene symbol
type:區(qū)分gene和其他類型升薯,gene只有60483個(gè)
table(gtf$type)
##
## gene transcript exon CDS start_codon
## 60483 198442 1172082 699443 82228
## stop_codon UTR Selenocysteine
## 74337 276542 114
if(!file.exists("D:/jianshu/R-TCGA-m6AlncRNA/gtf_gene.Rdata")){
library(tidyverse)
gtf_gene <- gtf[gtf$type=="gene",] %>% #提取type為gene的行
select(gene_id,gene_name,gene_type)
gtf_mRNA <- gtf_gene[gtf_gene$gene_type=="protein_coding",] #提取mRNA
lnc = c("3prime_overlapping_ncRNA", "antisense", "bidirectional_promoter_lncRNA", "lincRNA", "macro_lncRNA", "non_coding", "processed_transcript", "sense_intronic" , "sense_overlapping")
gtf_lncRNA <- gtf_gene[gtf_gene$gene_type %in% lnc,]
save(gtf_mRNA,gtf_lncRNA,file = "D:/jianshu/R-TCGA-m6AlncRNA/gtf_gene.Rdata")
}
3. m6A和lncRNA求相關(guān)性
分別提取m6A基因表達(dá)矩陣、lncRNA表達(dá)矩陣
刪除正常樣品
相關(guān)性檢驗(yàn): ** a.檢測(cè)lncRNA sd是否大于0.1击困; ** b.檢驗(yàn)lncRNA 在正常與腫瘤組間是否有差異涎劈; ** b.判斷p.value是否小于0.05; ** c.判斷相關(guān)系數(shù)(cor)是否大于閾值
rm(list = ls())
library(limma)
corFilter=0.4 #設(shè)置相關(guān)系數(shù)閾值
pvalueFilter=0.001 #設(shè)置p值過濾標(biāo)準(zhǔn)
# lncRNA數(shù)據(jù)處理
rt <- data.table::fread("D:/jianshu/R-TCGA-m6AlncRNA/lncRNA.txt",data.table = F)
rt <- as.matrix(rt) #變?yōu)閙atrix
rownames(rt) <- rt[,1]
exp <- rt[,2:ncol(rt)]
dimnames <- list(rownames(exp),colnames(exp))
data <- matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames = dimnames)
data <- avereps(data)
data <- data[rowMeans(data)>0.1,] #去掉數(shù)值比較小的lncRNA
library(tidyverse)
lncRNA <- data[,str_sub(colnames(data),14,15)<10]
# m6A數(shù)據(jù)處理
rt1 <- data.table::fread("D:/jianshu/R-TCGA-m6AlncRNA/m6aGeneExp.txt",data.table = F)
rt1 <- as.matrix(rt1) #變?yōu)閙atrix
rownames(rt1) <- rt1[,1]
exp1 <- rt1[,2:ncol(rt1)]
dimnames1 <- list(rownames(exp1),colnames(exp1))
m6A <- matrix(as.numeric(as.matrix(exp1)),nrow=nrow(exp1),dimnames = dimnames1)
m6A <- avereps(m6A)
m6A <- m6A[rowMeans(m6A)>0.1,] #去掉數(shù)值比較小的lncRNA
m6A <- m6A[,str_sub(colnames(m6A),14,15)<10]
# m6A和lncRNA相關(guān)性檢驗(yàn)
sampleType <- c(rep(1,ncol(data[,str_sub(colnames(data),14,15)>10])),
rep(2,ncol(data[,str_sub(colnames(data),14,15)<10])))
outTab=data.frame()
for(i in row.names(lncRNA)){
if(sd(lncRNA[i,])>0.1){
test=wilcox.test(data[i,] ~ sampleType)
if(test$p.value<0.05){
for(j in row.names(m6A)){
x=as.numeric(lncRNA[i,])
y=as.numeric(m6A[j,])
corT=cor.test(x,y)
cor=corT$estimate
pvalue=corT$p.value
if((cor>corFilter) & (pvalue<pvalueFilter)){
outTab=rbind(outTab,cbind(m6A=j,lncRNA=i,cor,pvalue,Regulation="postive"))
}
if((cor< -corFilter) & (pvalue<pvalueFilter)){
outTab=rbind(outTab,cbind(m6A=j,lncRNA=i,cor,pvalue,Regulation="negative"))
}
}
}
}
}
write.table(outTab,file = "D:/jianshu/R-TCGA-m6AlncRNA/m6AlncRNA-network.txt",sep = "\t",quote = F,col.names = F)
參考文獻(xiàn):
從TCGA表達(dá)矩陣中分別提取mRNA和lncRNA→生信星球(公眾號(hào))