Monocle包是一個(gè)強(qiáng)大的單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析工具碍岔,它引入了一個(gè)對(duì)單細(xì)胞進(jìn)行排序的新方法(擬時(shí)序分析),通過利用不同細(xì)胞轉(zhuǎn)錄調(diào)控的異步過程拒迅,將它們沿著與生物過程(如細(xì)胞分化等)相對(duì)應(yīng)的發(fā)育軌跡進(jìn)行排序施符。Monocle通過使用先進(jìn)的機(jī)器學(xué)習(xí)算法(Reversed Graph Embedding反向圖嵌入)從單細(xì)胞基因組數(shù)據(jù)中學(xué)習(xí)顯式主圖來對(duì)不同的單細(xì)胞進(jìn)行排序甲馋,該方法對(duì)于解決那些復(fù)雜的生物學(xué)過程具有很高的魯棒性和準(zhǔn)確性养泡。
Monocle2主要執(zhí)行以下三種類型的分析:
- 對(duì)細(xì)胞進(jìn)行聚類嗜湃,分類和計(jì)數(shù)奈应。
- 構(gòu)建單細(xì)胞發(fā)育分化軌跡。
- 基因差異表達(dá)分析购披。
安裝monocle2及其依賴包
source("http://bioconductor.org/biocLite.R")
biocLite("monocle")
biocLite(c("DDRTree", "pheatmap"))
library(monocle)
monocle2基本分析流程
構(gòu)建CellDataSet對(duì)象存儲(chǔ)表達(dá)數(shù)據(jù)
# 加載樣本的metadata注釋信息
pd <- new("AnnotatedDataFrame", data = sample_sheet)
# 加載基因的metadata注釋信息
fd <- new("AnnotatedDataFrame", data = gene_annotation)
# 使用newCellDataSet函數(shù)構(gòu)建CellDataSet對(duì)象
cds <- newCellDataSet(expr_matrix, phenoData = pd, featureData = fd)
通過marker基因?qū)?xì)胞進(jìn)行分類
cth <- newCellTypeHierarchy()
# 選擇marker基因
MYF5_id <- row.names(subset(fData(cds), gene_short_name == "MYF5"))
ANPEP_id <- row.names(subset(fData(cds), gene_short_name == "ANPEP"))
# 根據(jù)marker基因添加細(xì)胞類型注釋
cth <- addCellType(cth, "Myoblast", classify_func =
function(x) { x[MYF5_id,] >= 1 })
cth <- addCellType(cth, "Fibroblast", classify_func =
function(x) { x[MYF5_id,] < 1 & x[ANPEP_id,] > 1 } )
# 使用classifyCells函數(shù)根據(jù)marker基因?qū)?xì)胞進(jìn)行分類
cds <- classifyCells(cds, cth, 0.1)
對(duì)細(xì)胞進(jìn)行聚類分群
# 使用clusterCells函數(shù)對(duì)細(xì)胞進(jìn)行聚類分群
cds <- clusterCells(cds)
擬時(shí)序分析對(duì)細(xì)胞進(jìn)行排序
disp_table <- dispersionTable(cds)
# 根據(jù)基因的表達(dá)選擇排序的基因
ordering_genes <- subset(disp_table, mean_expression >= 0.1)
cds <- setOrderingFilter(cds, ordering_genes)
# 使用reduceDimension函數(shù)進(jìn)行數(shù)據(jù)降維
cds <- reduceDimension(cds)
# 使用orderCells函數(shù)對(duì)細(xì)胞進(jìn)行排序
cds <- orderCells(cds)
執(zhí)行基因表達(dá)差異分析
# 使用differentialGeneTest函數(shù)進(jìn)行差異分析
diff_test_res <- differentialGeneTest(cds, fullModelFormulaStr = "~Media")
sig_genes <- subset(diff_test_res, qval < 0.1)
數(shù)據(jù)導(dǎo)入與CellDataSet對(duì)象構(gòu)建
Monocle2
使用CellDataSet
對(duì)象存儲(chǔ)單細(xì)胞RNA-seq的基因表達(dá)矩陣及其相應(yīng)的metadata注釋信息钥组,該對(duì)象衍生自Bioconductor中的ExpressionSet類。
CellDataSet
類主要需要以下三個(gè)輸入文件:
- exprs今瀑,表示數(shù)值的基因表達(dá)矩陣,其中行是基因点把,列是細(xì)胞橘荠。
- phenoData,一個(gè)AnnotatedDataFrame對(duì)象郎逃,其中行是細(xì)胞哥童,列是細(xì)胞的屬性(例如細(xì)胞的類型,培養(yǎng)條件褒翰,捕獲的天數(shù)等)贮懈。
- featureData,一個(gè)AnnotatedDataFrame對(duì)象优训,其中行是特征(例如基因)朵你,列是基因的屬性,例如生物型揣非,gc含量等抡医。
其中,這三個(gè)輸入文件必須滿足以下要求:
- 基因表達(dá)矩陣的列數(shù)與phenoData的行數(shù)相同早敬。
- 基因表達(dá)矩陣的行數(shù)與featureData的行數(shù)相同忌傻。
- phenoData對(duì)象的行名稱應(yīng)與表達(dá)矩陣的列名稱相匹配。
- featureData對(duì)象的行名應(yīng)與表達(dá)矩陣的行名相匹配搞监。
- featureData中必須有一個(gè)名為“gene_short_name”的列水孩。
library(monocle)
# 讀取所需的三個(gè)輸入文件
# 基因表達(dá)矩陣
HSMM_expr_matrix <- read.table("fpkm_matrix.txt")
# 細(xì)胞注釋信息
HSMM_sample_sheet <- read.delim("cell_sample_sheet.txt")
# 基因注釋信息
HSMM_gene_annotation <- read.delim("gene_annotations.txt")
# 構(gòu)建CellDataSet對(duì)象
pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet)
fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation)
HSMM <- newCellDataSet(as.matrix(HSMM_expr_matrix),
phenoData = pd, featureData = fd)
默認(rèn)情況下,Monocle2假定我們讀取的表達(dá)數(shù)據(jù)是以轉(zhuǎn)錄計(jì)數(shù)的原始count值琐驴,并使用負(fù)二項(xiàng)分布模型來進(jìn)行下游的基因表達(dá)差異分析俘种。
注意:如果我們有原始的UMI counts計(jì)數(shù)矩陣,則在創(chuàng)建CellDataSet對(duì)象之前不應(yīng)該對(duì)其進(jìn)行規(guī)歸一化處理棍矛,也不應(yīng)嘗試將UMI計(jì)數(shù)轉(zhuǎn)換為相對(duì)豐度(如將其轉(zhuǎn)換為FPKM / TPM數(shù)據(jù))安疗。Monocle2將在內(nèi)部執(zhí)行所有必需的標(biāo)準(zhǔn)化步驟,自己提前進(jìn)行歸一化處理可能會(huì)破壞Monocle的一些關(guān)鍵步驟够委。
Monocle2也可以通過importCDS
和exportCDS
函數(shù)將其他格式的數(shù)據(jù)導(dǎo)入為CellDataSet對(duì)象荐类,或?qū)ellDataSet對(duì)象導(dǎo)出為其他格式。
# Where 'data_to_be_imported' can either be a Seurat object or an SCESet.
# 將其他格式的數(shù)據(jù)導(dǎo)入為CellDataSet對(duì)象
importCDS(data_to_be_imported)
# We can set the parameter 'import_all' to TRUE if we'd like to import all the slots from our Seurat object or SCESet. (Default is FALSE or only keep minimal dataset)
importCDS(data_to_be_imported, import_all = TRUE)
lung <- load_lung()
# To convert to Seurat object
# 將CellDataSet對(duì)象導(dǎo)出為Seurat對(duì)象
lung_seurat <- exportCDS(lung, 'Seurat')
# To convert to SCESet
# 將CellDataSet對(duì)象導(dǎo)出為SCESet對(duì)象
lung_SCESet <- exportCDS(lung, 'Scater')
Choosing a distribution for your data
Monocle2既可以使用基因的相對(duì)表達(dá)豐度茁帽,也可以使用基于計(jì)數(shù)的原始count值作為輸入的表達(dá)矩陣玉罐。通常屈嗤,它最適合使用轉(zhuǎn)錄本的原始count值,尤其是UMI數(shù)據(jù)吊输。無論我們使用哪種類型的數(shù)據(jù)饶号,都必須為其指定適當(dāng)?shù)姆植肌PKM / TPM值通常呈對(duì)數(shù)正態(tài)分布季蚂,而UMI或read counts可更好地使用負(fù)二項(xiàng)分布建模茫船。對(duì)于原始的count計(jì)數(shù)數(shù)據(jù),請(qǐng)將負(fù)二項(xiàng)式分布指定為newCellDataSet的expressionFamily參數(shù)扭屁。
#Do not run
HSMM <- newCellDataSet(count_matrix,
phenoData = pd,
featureData = fd,
expressionFamily=negbinomial.size())
以下是一些常用的expressionFamily值及其適用條件:
Notes:Using the
wrong expressionFamily
for your data willlead to bad results, errors from Monocle, or both
. However, if you have FPKM/TPM data, you canstill use negative binomial
if you first convert your relative expression values to transcript counts usingrelative2abs()
. This often leads to much more accurate results than usingtobit()
. See the section onConverting TPM to mRNA Counts
for details.
Working with large data sets
對(duì)于一些大型的單細(xì)胞RNA-seq數(shù)據(jù)集算谈,如具有50,000個(gè)細(xì)胞和25,000+個(gè)基因組成的基因表達(dá)矩陣會(huì)占用大量的內(nèi)存。但是料滥,由于當(dāng)前的測(cè)序技術(shù)通常無法捕獲每個(gè)細(xì)胞中的全部或大多數(shù)mRNA分子然眼,因此許多基因的表達(dá)均會(huì)被檢測(cè)為零。我們通常建議大多數(shù)用戶使用稀疏矩陣sparseMatrices的表達(dá)數(shù)據(jù)葵腹,它會(huì)加快許多計(jì)算的速度高每。
我們可以使用Matrix
包將基因表達(dá)矩陣轉(zhuǎn)換為稀疏矩陣提供給Monocle:
HSMM <- newCellDataSet(as(umi_matrix, "sparseMatrix"),
phenoData = pd,
featureData = fd,
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())
如果我們有10X Genomics產(chǎn)生的counts數(shù)據(jù),可以使用cellrangerRkit
包中的load_cellranger_matrix
函數(shù)直接加載基因表達(dá)矩陣践宴。
cellranger_pipestance_path <- "/path/to/your/pipeline/output/directory"
gbm <- load_cellranger_matrix(cellranger_pipestance_path)
# 獲取基因metadata注釋信息
fd <- fData(gbm)
# The number 2 is picked arbitrarily in the line below.
# Where "2" is placed you should place the column number that corresponds to your featureData's gene short names.
# 將基因注釋信息的第二列命名為gene_short_name
colnames(fd)[2] <- "gene_short_name"
# 構(gòu)建CellDataSet對(duì)象
gbm_cds <- newCellDataSet(exprs(gbm),
phenoData = new("AnnotatedDataFrame", data = pData(gbm)),
featureData = new("AnnotatedDataFrame", data = fd),
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())
Converting TPM/FPKM values into mRNA counts
如果我們得到的基因表達(dá)矩陣是已經(jīng)進(jìn)行了歸一化處理的數(shù)據(jù)(如FPKM或TPM值)鲸匿,我們可以使用relative2abs
函數(shù)在創(chuàng)建CellDataSet對(duì)象之前將其轉(zhuǎn)換為RPC值,如下所示:
pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet)
fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation)
# First create a CellDataSet from the relative expression levels
HSMM <- newCellDataSet(as.matrix(HSMM_expr_matrix),
phenoData = pd,
featureData = fd,
lowerDetectionLimit = 0.1,
expressionFamily = tobit(Lower = 0.1))
# Next, use it to estimate RNA counts
rpc_matrix <- relative2abs(HSMM, method = "num_genes")
# Now, make a new CellDataSet using the RNA counts
HSMM <- newCellDataSet(as(as.matrix(rpc_matrix), "sparseMatrix"),
phenoData = pd,
featureData = fd,
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())
變異系數(shù)計(jì)算和數(shù)據(jù)過濾預(yù)處理
Estimate size factors and dispersions
Monocle2使用estimateSizeFactors
和estimateDispersions
兩個(gè)函數(shù)計(jì)算Size factors和"dispersion" values浴井,Size factors可以幫助我們對(duì)不同細(xì)胞中mRNA表達(dá)的差異進(jìn)行歸一化處理晒骇,"dispersion" values將有助于后續(xù)的基因差異表達(dá)分析。
HSMM <- estimateSizeFactors(HSMM)
HSMM <- estimateDispersions(HSMM)
Notes: estimateSizeFactors() and estimateDispersions() will only work, and are only needed, if you are working with a CellDataSet with a negbinomial() or negbinomial.size() expression family.
Filtering low-quality cells
Monocle2使用detectGenes
函數(shù)查看基因的表達(dá)情況磺浙,并根據(jù)基因的表達(dá)預(yù)測(cè)其是否用于后續(xù)的細(xì)胞排序分析洪囤。
HSMM <- detectGenes(HSMM, min_expr = 0.1)
print(head(fData(HSMM)))
gene_short_name biotype num_cells_expressed use_for_ordering
ENSG00000000003.10 TSPAN6 protein_coding 184 FALSE
ENSG00000000005.5 TNMD protein_coding 0 FALSE
ENSG00000000419.8 DPM1 protein_coding 211 FALSE
ENSG00000000457.8 SCYL3 protein_coding 18 FALSE
ENSG00000000460.12 C1orf112 protein_coding 47 TRUE
ENSG00000000938.8 FGR protein_coding 0 FALSE
print(head(pData(HSMM)))
Library Well Hours Media Mapped.Fragments Pseudotime State Size_Factor num_genes_expressed
T0_CT_A01 SCC10013_A01 A01 0 GM 1958074 23.916673 1 1.392811 6850
T0_CT_A03 SCC10013_A03 A03 0 GM 1930722 9.022265 1 1.311607 6947
T0_CT_A05 SCC10013_A05 A05 0 GM 1452623 7.546608 1 1.218922 7019
T0_CT_A06 SCC10013_A06 A06 0 GM 2566325 21.463948 1 1.013981 5560
T0_CT_A07 SCC10013_A07 A07 0 GM 2383438 11.299806 1 1.085580 5998
T0_CT_A08 SCC10013_A08 A08 0 GM 1472238 67.436042 2 1.099878 6055
# 選取特定的基因
expressed_genes <- row.names(subset(fData(HSMM),
num_cells_expressed >= 50))
現(xiàn)在,expressed_genes向量中存儲(chǔ)著至少在50個(gè)細(xì)胞中表達(dá)的基因撕氧,我們將用其對(duì)后續(xù)的細(xì)胞按照生物學(xué)過程的順序進(jìn)行排序的分析瘤缩。
# 對(duì)細(xì)胞進(jìn)行過濾
valid_cells <- row.names(subset(pData(HSMM),
Cells.in.Well == 1 &
Control == FALSE &
Clump == FALSE &
Debris == FALSE &
Mapped.Fragments > 1000000))
HSMM <- HSMM[,valid_cells]
# 根據(jù)基因的表達(dá)對(duì)基因進(jìn)行過濾
pData(HSMM)$Total_mRNAs <- Matrix::colSums(exprs(HSMM))
HSMM <- HSMM[,pData(HSMM)$Total_mRNAs < 1e6]
upper_bound <- 10^(mean(log10(pData(HSMM)$Total_mRNAs)) +
2*sd(log10(pData(HSMM)$Total_mRNAs)))
lower_bound <- 10^(mean(log10(pData(HSMM)$Total_mRNAs)) -
2*sd(log10(pData(HSMM)$Total_mRNAs)))
# 查看基因的表達(dá)分布
qplot(Total_mRNAs, data = pData(HSMM), color = Hours, geom =
"density") +
geom_vline(xintercept = lower_bound) +
geom_vline(xintercept = upper_bound)
# 過濾掉那些極高表達(dá)和極低表達(dá)的細(xì)胞
HSMM <- HSMM[,pData(HSMM)$Total_mRNAs > lower_bound &
pData(HSMM)$Total_mRNAs < upper_bound]
HSMM <- detectGenes(HSMM, min_expr = 0.1)
# Log-transform each value in the expression matrix.
L <- log(exprs(HSMM[expressed_genes,]))
# Standardize each gene, so that they are all on the same scale, Then melt the data with plyr so we can plot it easily
melted_dens_df <- melt(Matrix::t(scale(Matrix::t(L))))
# Plot the distribution of the standardized gene expression values.
qplot(value, geom = "density", data = melted_dens_df) +
stat_function(fun = dnorm, size = 0.5, color = 'red') +
xlab("Standardized log(FPKM)") +
ylab("Density")