1.輸入數(shù)據(jù)要什么
下面這段話摘自CIBERSORT的介紹
Importantly, all expression data should be non-negative, devoid of missing values, and represented in non-log linear space.
For Affymetrix microarrays, a custom chip definition file (CDF) is recommended (see Subheading 3.2.2) and should be normalized with MAS5 or RMA.
Illumina Beadchip and single color Agilent arrays should be processed as described in the limma package.
Standard RNA-Seq expression quantification metrics, such as frag- ments per kilobase per million (FPKM) and transcripts per kilobase million (TPM), are suitable for use with CIBERSORT. –《Profiling Tumor Infiltrating Immune Cells with CIBERSORT》
非常清楚的寫出了輸入數(shù)據(jù)的要求: 1.不可以有負值和缺失值 2.不要取log 3.如果是芯片數(shù)據(jù)镰矿,昂飛芯片使用RMA標準化,Illumina 的Beadchip 和Agilent的單色芯片,用limma處理。 4.如果是RNA-Seq表達量,使用FPKM和TPM都很合適。
芯片的要求可能把你唬住了,GEO常規(guī)的表達矩陣都是這樣得到的句喷,直接下載使用即可镣典。注意有的表達矩陣下載下來就已經(jīng)取過log,需要逆轉(zhuǎn)回去唾琼。有的經(jīng)過了標準化或者有負值兄春,需要處理原始數(shù)據(jù),前面寫過介紹文:
http://www.reibang.com/p/d7035ba8347b
http://www.reibang.com/p/e3d734b2c404
3.來一個示例
3.1.下載TCGA的RNA-seq表達數(shù)據(jù)
有多個渠道可以下載count或者fpkm數(shù)據(jù)锡溯。其實fpkm轉(zhuǎn)tpm更無痛神郊,但因為之前的教程都是只下載count,做后續(xù)的差異分析趾唱,我也不想再回過頭去下載fpkm了涌乳。就在count基礎(chǔ)上轉(zhuǎn)tpm即可。
得到TCGA-CHOL_gdc.Rdata的方法可參考:TCGA-1.GDC數(shù)據(jù)下載
rm(list = ls())
library(tinyarray)
library(tidyverse)
load("TCGA-CHOL_gdc.Rdata")
exp[1:4,1:4]
## TCGA-W5-AA36-01A-11R-A41I-07 TCGA-W5-AA2H-01A-31R-A41I-07
## ENSG00000000003.13 2504 226
## ENSG00000000005.5 0 5
## ENSG00000000419.11 1272 1146
## ENSG00000000457.12 504 602
## TCGA-ZU-A8S4-11A-11R-A41I-07 TCGA-WD-A7RX-01A-12R-A41I-07
## ENSG00000000003.13 4107 9646
## ENSG00000000005.5 0 1
## ENSG00000000419.11 741 1266
## ENSG00000000457.12 312 1317
# 表達矩陣的行名轉(zhuǎn)換成genesymbol
exp = trans_exp(exp,mrna_only = T)
exp[1:4,1:4]
## TCGA-W5-AA36-01A-11R-A41I-07 TCGA-W5-AA2H-01A-31R-A41I-07
## TSPAN6 2504 226
## TNMD 0 5
## DPM1 1272 1146
## SCYL3 504 602
## TCGA-ZU-A8S4-11A-11R-A41I-07 TCGA-WD-A7RX-01A-12R-A41I-07
## TSPAN6 4107 9646
## TNMD 0 1
## DPM1 741 1266
## SCYL3 312 1317
從count矩陣得到tpm甜癞,參考:基因長度并不是end-start夕晓。TCGA使用的參考基因組注釋版本是genecodeV22。
3.2.將count轉(zhuǎn)為tpm悠咱,每個函數(shù)需要單獨運行
首先是計算基因有效長度蒸辆,因為tcga統(tǒng)一使用了v22版本,所以替換其他癌癥并不需要重新計算析既,可以直接拿來用的躬贡。
if(F){
library(rtracklayer)
gtf = rtracklayer::import("gencode.v22.annotation.gtf.gz")
class(gtf)
gtf = as.data.frame(gtf);dim(gtf)
table(gtf$type)
exon = gtf[gtf$type=="exon",
c("start","end","gene_name")]
gle = lapply(split(exon,exon$gene_name),function(x){
tmp=apply(x,1,function(y){
y[1]:y[2]
})
length(unique(unlist(tmp)))
})
gle=data.frame(gene_name=names(gle),
length=as.numeric(gle))
save(gle,file = "v22_gle.Rdata")
}
load("v22_gle.Rdata")
head(gle)
## gene_name length
## 1 5_8S_rRNA 303
## 2 5S_rRNA 2901
## 3 7SK 3562
## 4 A1BG 4006
## 5 A1BG-AS1 2793
## 6 A1CF 9603
基因長度需要和表達矩陣行的順序?qū)饋恚玫絉語言基礎(chǔ)里非常優(yōu)秀的一個函數(shù)–match眼坏。
le = gle[match(rownames(exp),gle$gene_name),"length"]
#這個函數(shù)是現(xiàn)成的拂玻。
countToTpm <- function(counts, effLen)
{
rate <- log(counts) - log(effLen)
denom <- log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}
tpms <- apply(exp,2,countToTpm,le)
tpms[1:3,1:3]
## TCGA-W5-AA36-01A-11R-A41I-07 TCGA-W5-AA2H-01A-31R-A41I-07
## TSPAN6 40.19320 3.8584717
## TNMD 0.00000 0.2404519
## DPM1 76.71414 73.5125551
## TCGA-ZU-A8S4-11A-11R-A41I-07
## TSPAN6 46.52878
## TNMD 0.00000
## DPM1 31.54171
至此得到了tpm矩陣。
3.3 做成cibersort要求的輸入文件
這個算法并沒有被寫成R包宰译,而是只有一個放著函數(shù)的腳本–CIBERSORT.R檐蚜,把它下載下來放在工作目錄即可。
需要兩個輸入文件:
一個是表達矩陣文件
一個是官網(wǎng)提供的LM22.txt沿侈,記錄了22種免疫細胞的基因表達特征數(shù)據(jù)闯第。
由于CIBERSORT.R讀取文件的代碼比較粗暴,為了適應它缀拭,導出文件之前需要把行名變成一列咳短。不然后面就會有報錯。
exp2 = as.data.frame(tpms)
exp2 = rownames_to_column(exp2)
write.table(exp2,file = "exp.txt",row.names = F,quote = F,sep = "\t")
3.4. 運行CIBERSORT
source("CIBERSORT.R")
if(F){
TME.results = CIBERSORT("LM22.txt",
"exp.txt" ,
perm = 1000,
QN = T)
save(TME.results,file = "ciber_CHOL.Rdata")
}
load("ciber_CHOL.Rdata")
TME.results[1:4,1:4]
## B cells naive B cells memory Plasma cells
## TCGA-W5-AA36-01A-11R-A41I-07 0.00000000 0.002351185 0.02550133
## TCGA-W5-AA2H-01A-31R-A41I-07 0.04512086 0.354414124 0.01961627
## TCGA-ZU-A8S4-11A-11R-A41I-07 0.00203370 0.000000000 0.04582565
## TCGA-WD-A7RX-01A-12R-A41I-07 0.15785229 0.000000000 0.01847074
## T cells CD8
## TCGA-W5-AA36-01A-11R-A41I-07 0.07766099
## TCGA-W5-AA2H-01A-31R-A41I-07 0.14262301
## TCGA-ZU-A8S4-11A-11R-A41I-07 0.09962641
## TCGA-WD-A7RX-01A-12R-A41I-07 0.13769951
re <- TME.results[,-(23:25)]
運行有些慢蛛淋。計算出來的結(jié)果包含了22種免疫細胞的豐度咙好,還有三列其他統(tǒng)計量,不管它們铣鹏。
3.5. 經(jīng)典的免疫細胞豐度熱圖
那些在一半以上樣本里豐度為0的免疫細胞敷扫,就不展示在熱圖里了哀蘑。我看了一下這個熱圖诚卸,從聚類的情況來看葵第,normal和tumor沒有很好的分開。
library(pheatmap)
k <- apply(re,2,function(x) {sum(x == 0) < nrow(TME.results)/2})
table(k)
## k
## FALSE TRUE
## 8 14
re2 <- as.data.frame(t(re[,k]))
an = data.frame(group = Group,
row.names = colnames(exp))
pheatmap(re2,scale = "row",
show_colnames = F,
annotation_col = an,
color = colorRampPalette(c("navy", "white", "firebrick3"))(50))
3.6. 直方圖
可以展示出每個樣本的免疫細胞比例
library(RColorBrewer)
mypalette <- colorRampPalette(brewer.pal(8,"Set1"))
dat <- re %>% as.data.frame() %>%
rownames_to_column("Sample") %>%
gather(key = Cell_type,value = Proportion,-Sample)
ggplot(dat,aes(Sample,Proportion,fill = Cell_type)) +
geom_bar(stat = "identity") +
labs(fill = "Cell Type",x = "",y = "Estiamted Proportion") +
theme_bw() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "bottom") +
scale_y_continuous(expand = c(0.01,0)) +
scale_fill_manual(values = mypalette(22))
3.7 箱線圖
展示免疫細胞之間的比較合溺。
ggplot(dat,aes(Cell_type,Proportion,fill = Cell_type)) +
geom_boxplot(outlier.shape = 21,color = "black") +
theme_bw() +
labs(x = "Cell Type", y = "Estimated Proportion") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "bottom") +
scale_fill_manual(values = mypalette(22))
亂了點卒密?那就讓箱線圖擁有順序吧。
a = dat %>%
group_by(Cell_type) %>%
summarise(m = median(Proportion)) %>%
arrange(desc(m)) %>%
pull(Cell_type)
dat$Cell_type = factor(dat$Cell_type,levels = a)
ggplot(dat,aes(Cell_type,Proportion,fill = Cell_type)) +
geom_boxplot(outlier.shape = 21,color = "black") +
theme_bw() +
labs(x = "Cell Type", y = "Estimated Proportion") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "bottom") +
scale_fill_manual(values = mypalette(22))
既然我們已經(jīng)把正常樣本也算了棠赛,那就做個比較:
dat$Group = ifelse(as.numeric(str_sub(dat$Sample,14,15))<10,"tumor","normal")
library(ggpubr)
ggplot(dat,aes(Cell_type,Proportion,fill = Group)) +
geom_boxplot(outlier.shape = 21,color = "black") +
theme_bw() +
labs(x = "Cell Type", y = "Estimated Proportion") +
theme(legend.position = "top") +
theme(axis.text.x = element_text(angle=80,vjust = 0.5))+
scale_fill_manual(values = mypalette(22)[c(6,1)])+ stat_compare_means(aes(group = Group,label = ..p.signif..),method = "kruskal.test")
分開看的話確實能看出區(qū)別??哮奇,只是不顯著的太多了,才導致熱圖聚類成那副樣子睛约,不重要了鼎俘。
作者:小潔忘了怎么分身
鏈接:http://www.reibang.com/p/03a7440c0960
來源:簡書
著作權(quán)歸作者所有。商業(yè)轉(zhuǎn)載請聯(lián)系作者獲得授權(quán)辩涝,非商業(yè)轉(zhuǎn)載請注明出處贸伐。