Inferring tumour purity and stromal and immune cell admixture from expression data.励稳,NC,3013
單細(xì)胞轉(zhuǎn)錄組是揭示細(xì)胞異質(zhì)性的的有力武器面褐,鑒于腫瘤的異質(zhì)性,這一點(diǎn)在腫瘤樣本中表現(xiàn)尤為突出翘地。所以腫瘤樣本的單細(xì)胞轉(zhuǎn)錄組就不只是無監(jiān)督地分個(gè)群那么簡單蛛倦,基于我們對(duì)腫瘤樣本已經(jīng)積累起來的生物學(xué)背景(如TCGA)年柠,我們可以從更多側(cè)面來反映和說明腫瘤樣本的異質(zhì)性扰楼。
今天給大家介紹一款根據(jù)stromal和immune細(xì)胞比例估算腫瘤純度的工具:ESTIMATE。之前是基于bulk表達(dá)譜來做的划鸽,在簡書已經(jīng)有詳細(xì)的介紹了:
文章解讀:
利用表達(dá)數(shù)據(jù)計(jì)算基質(zhì)打分與免疫打分進(jìn)而預(yù)測腫瘤純度 --- ESTIMATE
代碼實(shí)踐:
使用ESTIMATE來根據(jù)stromal和immune細(xì)胞比例估算腫瘤純度
ESTIMATE (E
stimation of ST
romal and I
mmune cells in MA
lignant T
umor tissues using E
xpression data) is a tool for predicting tumor purity
, and the presence of infiltrating stromal/immune cells in tumor tissues using gene expression data
. ESTIMATE algorithm is based on single sample Gene Set Enrichment Analysis and generates three scores:
- stromal score (that captures the presence of stroma in tumor tissue),
- immune score (that represents the infiltration of immune cells in tumor tissue), and
- estimate score (that infers tumor purity).
可以看到和目前單細(xì)胞轉(zhuǎn)錄組中有些基于富集的細(xì)胞類型定義還是很像的输莺,根據(jù)一個(gè)基因list通過某種規(guī)則(這里是ssGSEA)來對(duì)細(xì)胞打分戚哎,進(jìn)而推斷出細(xì)胞的類型。
所以正如生信技能樹在簡書上所言:
其實(shí)對(duì)大部分使用該包的的文章來說嫂用,需要的反而是該包定義的2個(gè)基因集型凳,stromal 和 immune , 列表是:
StromalSignature estimate DCN PAPPA SFRP4 THBS2 LY86 CXCL14 FOXF1 COL10A1 ACTG2 APBB1IP SH2D1A SULF1 MSR1 C3AR1 FAP PTGIS ITGBL1 BGN CXCL12 ECM2 FCGR2A MS4A4A WISP1 COL1A2 MS4A6A EDNRA VCAM1 GPR124 SCUBE2 AIF1 HEPH LUM PTGER3 RUNX1T1 CDH5 PIK3R5 RAMP3 LDB2 COX7A1 EDIL3 DDR2 FCGR2B LPPR4 COL15A1 AOC3 ITIH3 FMO1 PRKG1 PLXDC1 VSIG4 COL6A3 SGCD COL3A1 F13A1 OLFML1IGSF6 COMP HGF GIMAP5 ABCA6 ITGAM MAF ITM2A CLEC7A ASPN LRRC15 ERG CD86 TRAT1 COL8A2 TCF21 CD93 CD163 GREM1 LMOD1TLR2 ZEB2 C1QB KCNJ8 KDR CD33 RASGRP3 TNFSF4 CCR1 CSF1R BTK MFAP5 MXRA5 ISLR ARHGAP28 ZFPM2 TLR7 ADAM12 OLFML2B ENPP2 CILP SIGLEC1 SPON2 PLXNC1 ADAMTS5 SAMSN1 CH25H COL14A1 EMCN RGS4 PCDH12 RARRES2 CD248 PDGFRB C1QA COL5A3 IGF1 SP140TFEC TNN ATP8B4 ZNF423 FRZB SERPING1 ENPEP CD14 DIO2 FPR1 IL18R1 HDC TXNDC3 PDE2A RSAD2 ITIH5 FASLG MMP3 NOX4 WNT2 LRRC32 CXCL9 ODZ4 FBLN2 EGFL6 IL1B SPON1 CD200
ImmuneSignature estimate LCP2 LSP1 FYB PLEK HCK IL10RA LILRB1 NCKAP1L LAIR1 NCF2 CYBB PTPRC IL7R LAPTM5 CD53 EVI2BSLA ITGB2 GIMAP4 MYO1F HCLS1 MNDA IL2RG CD48 AOAH CCL5 LTB GMFG GIMAP6 GZMK LST1 GPR65 LILRB2 WIPF1 CD37 BIN2 FCER1G IKZF1 TYROBP FGL2 FLI1 IRF8 ARHGAP15 SH2B3 TNFRSF1B DOCK2 CD2 ARHGEF6 CORO1A LY96 LYZ ITGAL TNFAIP3 RNASE6TGFB1 PSTPIP1 CST7 RGS1 FGR SELL MICAL1 TRAF3IP3 ITGA4 MAFB ARHGDIB IL4R RHOH HLA-DPA1 NKG7 NCF4 LPXN ITK SELPLG HLA-DPB1 CD3D CD300A IL2RB ADCY7 PTGER4 SRGN CD247 CCR7 MSN ALOX5AP PTGER2 RAC2 GBP2 VAV1 CLEC2B P2RY14 NFKBIAS100A9 IFI30 MFSD1 RASSF2 TPP1 RHOG CLEC4A GZMB PVRIG S100A8 CASP1 BCL2A1 HLA-E KLRB1 GNLY RAB27A IL18RAP TPST2 EMP3 GMIP LCK IL32 PTPRCAP LGALS9 CCDC69 SAMHD1 TAP1 GBP1 CTSS GZMH ADAM8 GLRX PRF1 CD69 HLA-B HLA-DMA CD74 KLRK1 PTPRE HLA-DRA VNN2 TCIRG1 RABGAP1L CSTA ZAP70 HLA-F HLA-G CD52 CD302 CD27
其實(shí)這R包的函數(shù)本不多:
# 安裝一下
library(utils)
rforge <- "http://r-forge.r-project.org"
install.packages("estimate", repos=rforge, dependencies=TRUE)
在這里我們就不再?estimateScore
來一步一步執(zhí)行示例數(shù)據(jù)了,雖然對(duì)于新手來說這一步往往是不能省略的嘱函。實(shí)在想看就看技能樹的吧:使用ESTIMATE來根據(jù)stromal和immune細(xì)胞比例估算腫瘤純度
我就想甘畅,這么好的工具單細(xì)胞能不能使用呢?我也是表達(dá)譜啊往弓,應(yīng)該是可以的吧疏唾,如果可以,Seurat的數(shù)據(jù)格式可不可以直接做呢函似?
帶著一系列疑問我們來試試槐脏。
無疑,作為表達(dá)譜我是合格的撇寞。關(guān)鍵就是數(shù)據(jù)格式的問題了顿天。我們發(fā)現(xiàn)這個(gè)R包操作的都是路徑,就連表達(dá)譜要求的都是:?estimate::filterCommonGenes
蔑担。如果想傳入一個(gè)Seurat的對(duì)象我們是要改造一個(gè)函數(shù)了牌废。
myfilterCommonGenes <- edit(estimate::filterCommonGenes)
看到這個(gè)函數(shù)的文件讀入是用read.table
的:
function (input.f, output.f, id = c("GeneSymbol", "EntrezID"))
{
stopifnot((is.character(input.f) && length(input.f) == 1 &&
nzchar(input.f)) || (inherits(input.f, "connection") &&
isOpen(input.f, "r")))
stopifnot((is.character(output.f) && length(output.f) ==
1 && nzchar(output.f)))
id <- match.arg(id)
input.df <- read.table(input.f, header = TRUE, row.names = 1,
sep = "\t", quote = "", stringsAsFactors = FALSE)
merged.df <- merge(common_genes, input.df, by.x = id, by.y = "row.names")
rownames(merged.df) <- merged.df$GeneSymbol
merged.df <- merged.df[, -1:-ncol(common_genes)]
print(sprintf("Merged dataset includes %d genes (%d mismatched).",
nrow(merged.df), nrow(common_genes) - nrow(merged.df)))
outputGCT(merged.df, output.f)
}
我想直接把seurat的某個(gè)gene-cell矩陣對(duì)象給它,于是就把輸入改成:
myfilterCommonGenes <- function(input.f, output.f, id = c("GeneSymbol", "EntrezID"))
{
id <- match.arg(id)
input.df <- input.f
merged.df <- merge(common_genes, input.df, by.x = id, by.y = "row.names")
rownames(merged.df) <- merged.df$GeneSymbol
merged.df <- merged.df[, -1:-ncol(common_genes)]
print(sprintf("Merged dataset includes %d genes (%d mismatched).",
nrow(merged.df), nrow(common_genes) - nrow(merged.df)))
outputGCT(merged.df, output.f)
}
為了讓改造后的函數(shù)依然在estimated的環(huán)境之中:
environment(myfilterCommonGenes) <- environment(estimate::filterCommonGenes)
我們來試試钟沛,讀入我們熟悉的pbmc數(shù)據(jù)畔规,注意這里的數(shù)據(jù)僅作為Seurat的演示示例:
pbmc <- readRDS('G:\\Desktop\\Desktop\\RStudio\\single_cell\\filtered_gene_bc_matrices\\hg19pbmc_tutorial.rds')
DefaultAssay(pbmc) <- "RNA"
myfilterCommonGenes(input.f= as.matrix(pbmc@assays$RNA@data), output.f=paste0("matrixgenes.gct"), id="GeneSymbol")
一行提示:
[1] "Merged dataset includes 7295 genes (3117 mismatched)."
同時(shí),在當(dāng)前路徑下生成了matrixgenes.gct
,然后我們啟動(dòng)打分程序:
?estimateScore # 還是建議大家多問
# c("affymetrix", "agilent", "illumina")
estimateScore(paste0("matrixgenes.gct"),paste0("estimate_score.gct"), platform= "affymetrix")
[1] "1 gene set: StromalSignature overlap= 58"
[1] "2 gene set: ImmuneSignature overlap= 141" # 可以看到overlap的基因比較少恨统,我們畢竟是10X的數(shù)據(jù)啊。
我們比較感興趣的就是這個(gè)打分文件estimate_score.gct
,我們來看那看是怎樣的三妈,以及如何把它寫入Seurat對(duì)象中畜埋。
estimate_score <- read.table(paste0("estimate_score.gct"),header = F,skip=2,stringsAsFactors = FALSE)
estimate_score <- t(estimate_score)
rownames(estimate_score) <- estimate_score[,1]
colnames(estimate_score) <- estimate_score[2,]
estimate_score <- estimate_score[-c(1:2),]
estimate_score[1:4,1:4]
讀進(jìn)來的被字符串化了:
Description StromalScore ImmuneScore ESTIMATEScore
AAACATACAACCAC "AAACATACAACCAC" "269.574325901855" "1068.73943272047" "1338.31375862232"
AAACATTGAGCTAC "AAACATTGAGCTAC" "157.640476166265" "1245.7269262377" "1403.36740240396"
AAACATTGATCAGC "AAACATTGATCAGC" "399.842094329677" "1208.89031399533" "1608.732408325"
AAACCGTGCTTCCG "AAACCGTGCTTCCG" "595.490469453362" "1418.44175293577" "2013.93222238914"
所以AddMetaData
要處理一下:
pbmc<-AddMetaData(pbmc,metadata = estimate_score,col.name = colnames(df))
pbmc@meta.data$StromalScore <- as.numeric(as.vector(pbmc@meta.data$StromalScore))
pbmc@meta.data$ImmuneScore <- as.numeric(as.vector(pbmc@meta.data$ImmuneScore))
pbmc@meta.data$TumorPurity <- as.numeric(as.vector(pbmc@meta.data$TumorPurity))
pbmc@meta.data$ESTIMATEScore <- as.numeric(as.vector(pbmc@meta.data$ESTIMATEScore))
可以看到給每個(gè)細(xì)胞都打了分,bulk的一個(gè)樣本是一個(gè)組織畴蒲,可以用腫瘤純度悠鞍,單個(gè)細(xì)胞也還是有的嗎?所以模燥,是不是可以用某一群的平均表達(dá)譜來做呢咖祭?其實(shí)看到這只是根據(jù)基因列表的打分機(jī)制,這也未嘗不可蔫骂。
head(pbmc@meta.data)
orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5 seurat_clusters Description
AAACATACAACCAC pbmc3k 2419 779 3.0177759 1 1 AAACATACAACCAC
AAACATTGAGCTAC pbmc3k 4903 1352 3.7935958 3 3 AAACATTGAGCTAC
AAACATTGATCAGC pbmc3k 3147 1129 0.8897363 1 1 AAACATTGATCAGC
AAACCGTGCTTCCG pbmc3k 2639 960 1.7430845 2 2 AAACCGTGCTTCCG
AAACCGTGTATGCG pbmc3k 980 521 1.2244898 6 6 AAACCGTGTATGCG
AAACGCACTGGTAC pbmc3k 2163 781 1.6643551 1 1 AAACGCACTGGTAC
StromalScore ImmuneScore ESTIMATEScore TumorPurity
AAACATACAACCAC 269.5743 1068.739 1338.314 0.6956758
AAACATTGAGCTAC 157.6405 1245.727 1403.367 0.6887845
AAACATTGATCAGC 399.8421 1208.890 1608.732 0.6666206
AAACCGTGCTTCCG 595.4905 1418.442 2013.932 0.6211327
AAACCGTGTATGCG 540.3476 1019.445 1559.792 0.6719582
AAACGCACTGGTAC 366.6339 1103.304 1469.938 0.6816675
可視化一下,看看不同分群下么翰,"StromalScore" "ImmuneScore" "ESTIMATEScore" "TumorPurity" 四個(gè)分?jǐn)?shù)的比例:
p1<- RidgePlot(pbmc,features = "StromalScore")
p2<- RidgePlot(pbmc,features = "ImmuneScore")
p3<- RidgePlot(pbmc,features = "TumorPurity")
p4<- RidgePlot(pbmc,features = "ESTIMATEScore")
# p4<- Seurat::CombinePlots(c(p1 ,p2,p3,p4))
library(gridExtra)
grid.arrange(p1 ,p2,p3,p4,ncol = 2 )
就TumorPurity 來繪制熟悉的箱型圖:
ggplot(pbmc@meta.data, aes(x=RNA_snn_res.0.5, y=TumorPurity,fill=RNA_snn_res.0.5)) +
geom_boxplot(notch=TRUE)
嚴(yán)格的話,可以做一下顯著性檢驗(yàn)啊辽旋,這里我們就不過了浩嫌。我比較好奇的是這四個(gè)是什么關(guān)系檐迟?為什么還有TumorPurity ,ESTIMATEScore码耐?
library(GGally)
ggpairs(pbmc@meta.data[,c("StromalScore","ImmuneScore","TumorPurity","ESTIMATEScore","seurat_clusters")],
mapping = aes(color = seurat_clusters))
可以看到TumorPurity 追迟,ESTIMATEScore是負(fù)相關(guān)的關(guān)系,其實(shí)知道一個(gè)就可以了骚腥。結(jié)合這些可視化的結(jié)果可以為我們了解哪些群的腫瘤純度如何敦间,從這個(gè)側(cè)面來解釋細(xì)胞的異質(zhì)性。
那么有沒有其他軟件呢束铭?顯然是有的廓块,一個(gè)可以用的就是Xcell了:https://github.com/dviraran/xCell
xCell is a gene signatures-based method
learned from thousands of pure cell types from various sources. xCell applies a novel technique for reducing associations between closely related cell types. xCell signatures were validated using extensive in-silico simulations and also cytometry immunophenotyping
, and were shown to outperform previous methods. xCell allows researchers to reliably portray the cellular heterogeneity landscape of tissue expression profiles.
還有一個(gè):https://cibersortx.stanford.edu/,需要注冊(cè)纯露。
官網(wǎng):
https://bioinformatics.mdanderson.org/public-software/estimate/