splatter
clp
15 June, 2020
Splatter
是一個(gè)用于簡(jiǎn)單模擬單細(xì)胞RNA測(cè)序數(shù)據(jù)的R包锉矢,本文概述并介紹Splatter的功能。
1. 安裝R包
2種安裝方法铜靶,1個(gè)是從Bioconductor安裝宦芦,一個(gè)是從GitHub安裝。
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("splatter")
# BiocManager::install("Oshlack/splatter", dependencies = TRUE,
# build_vignettes = TRUE)
假設(shè)您已經(jīng)有了一個(gè)與您想要模擬的計(jì)數(shù)數(shù)據(jù)矩陣類似的矩陣呐籽,使用Splatter創(chuàng)建模擬數(shù)據(jù)集有兩個(gè)簡(jiǎn)單的步驟盏浇。以下是使用scater
包生成的模擬數(shù)據(jù)集的示例:
2. quickstart
# Load package
library(splatter)
#> Warning: package 'splatter' was built under R version 3.6.2
# Create mock data
library(scater)
#> Warning: package 'ggplot2' was built under R version 3.6.2
set.seed(1)
sce <- mockSCE()
# Estimate parameters from mock data
params <- splatEstimate(sce)
# Simulate data using estimated parameters
sim <- splatSimulate(params)
這些步驟將在以下各節(jié)中詳細(xì)說(shuō)明变丧,但簡(jiǎn)要地說(shuō),第一步采用數(shù)據(jù)集并從中估計(jì)模擬參數(shù)绢掰,第二步采用這些參數(shù)并模擬新的數(shù)據(jù)集痒蓬。
3. The Splat simulation
在了解如何估計(jì)參數(shù)之前,讓我們先看看Splatter是如何模擬數(shù)據(jù)的滴劲,以及這些參數(shù)是什么攻晒。我們使用術(shù)語(yǔ)Splat
來(lái)指代Splatter自己的模擬,并將其與軟件包本身區(qū)分開(kāi)來(lái)班挖。Splat模型的核心是伽馬-泊松分布鲁捏,用于通過(guò)細(xì)胞計(jì)數(shù)矩陣生成基因。從gamma distribution模擬每個(gè)基因的平均表達(dá)水平萧芙,并且在從Poisson distribution模擬計(jì)數(shù)之前给梅,使用生物變異系數(shù)來(lái)強(qiáng)制均值-方差趨勢(shì)假丧。Splat
還允許您模擬表達(dá)異常基因(具有伽馬分布之外的平均表達(dá)的基因)和丟棄(dropout)(基于平均表達(dá)式隨機(jī)剔除計(jì)數(shù))动羽。每個(gè)單細(xì)胞都有一個(gè)預(yù)期的庫(kù)大小(從對(duì)數(shù)正態(tài)分布模擬)包帚,這使得它更容易與給定的數(shù)據(jù)集匹配。
Splat
還可以模擬不同類型細(xì)胞之間的差異表達(dá)或不同細(xì)胞類型之間的分化路徑运吓,其中表達(dá)以連續(xù)的方式變化渴邦。這些內(nèi)容將在simulating counts一節(jié)中進(jìn)一步介紹。
4. The SplatParams
object
Splat
模擬的所有參數(shù)都存儲(chǔ)在SplatParams
對(duì)象中拘哨。讓我們創(chuàng)建一個(gè)新的谋梭,看看它是什么樣子。
params <- newSplatParams()
params
#> A Params object of class SplatParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (Genes) (Cells) [Seed]
#> 10000 100 712777
#>
#> 28 additional parameters
#>
#> Batches:
#> [Batches] [Batch Cells] [Location] [Scale]
#> 1 100 0.1 0.1
#>
#> Mean:
#> (Rate) (Shape)
#> 0.3 0.6
#>
#> Library size:
#> (Location) (Scale) (Norm)
#> 11 0.2 FALSE
#>
#> Exprs outliers:
#> (Probability) (Location) (Scale)
#> 0.05 4 0.5
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [Probability] [Down Prob] [Location] [Scale]
#> 0.1 0.5 0.1 0.4
#>
#> BCV:
#> (Common Disp) (DoF)
#> 0.1 60
#>
#> Dropout:
#> [Type] (Midpoint) (Shape)
#> none 0 -1
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
除了告訴我們有什么類型的對(duì)象(“A Params
object of class SplatParams
”)并向我們顯示參數(shù)的值之外倦青,此輸出還為我們提供了一些額外的信息瓮床。我們可以看到,哪些參數(shù)可以通過(guò)splatEstimate
函數(shù)估計(jì)(those in parentheses)姨夹,哪些參數(shù)不能估計(jì)(those in brackets)纤垂,哪些參數(shù)的默認(rèn)值已經(jīng)改變(those in ALL CAPS)矾策。有關(guān)Splat
模擬參數(shù)的更多詳細(xì)信息磷账,請(qǐng)參閱Splat parameters vignette。
4.1 Getting and setting
getParam(params, "nGenes")
#> [1] 10000
或者贾虽,可以使用setParam
函數(shù)為參數(shù)賦予新值:
params <- setParam(params, "nGenes", 5000)
getParam(params, "nGenes")
#> [1] 5000
如果要提取多個(gè)參數(shù)(列表)或設(shè)置多個(gè)參數(shù)逃糟,可以使用getParams
或setParams
函數(shù):
# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
#> $nGenes
#> [1] 8000
#>
#> $mean.rate
#> [1] 0.5
#>
#> $mean.shape
#> [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (GENES) (Cells) [Seed]
#> 8000 100 712777
#>
#> 28 additional parameters
#>
#> Batches:
#> [Batches] [Batch Cells] [Location] [Scale]
#> 1 100 0.1 0.1
#>
#> Mean:
#> (RATE) (SHAPE)
#> 0.5 0.5
#>
#> Library size:
#> (Location) (Scale) (Norm)
#> 11 0.2 FALSE
#>
#> Exprs outliers:
#> (Probability) (Location) (Scale)
#> 0.05 4 0.5
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [PROBABILITY] [Down Prob] [Location] [Scale]
#> 0.2 0.5 0.1 0.4
#>
#> BCV:
#> (Common Disp) (DoF)
#> 0.1 60
#>
#> Dropout:
#> [Type] (Midpoint) (Shape)
#> none 0 -1
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
帶有已更改的參數(shù)現(xiàn)在顯示在所有大寫字母(ALL CAPS)中,以指示它們已從默認(rèn)值更改蓬豁。 我們也可以在調(diào)用newSplatParams
時(shí)直接設(shè)置參數(shù):
params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
getParams(params, c("lib.loc", "lib.scale"))
#> $lib.loc
#> [1] 12
#>
#> $lib.scale
#> [1] 0.6
5. Estimating parameters
Splat
允許您使用splatEstimate
函數(shù)從包含計(jì)數(shù)的數(shù)據(jù)集中估計(jì)它的許多參數(shù)绰咽。
# Get the mock counts matrix
counts <- counts(sce)
# Check that counts is an integer matrix
class(counts)
#> [1] "matrix"
typeof(counts)
#> [1] "double"
# Check the dimensions, each row is a gene, each column is a cell
dim(counts)
#> [1] 2000 200
# Show the first few entries
counts[1:5, 1:5]
#> Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
#> Gene_0001 0 5 7 276 50
#> Gene_0002 12 0 0 0 0
#> Gene_0003 97 292 58 64 541
#> Gene_0004 0 0 0 170 19
#> Gene_0005 105 123 174 565 1061
params <- splatEstimate(counts)
在這里,我們從計(jì)數(shù)矩陣估計(jì)參數(shù)地粪,但是splatEstimate
也可以接受SingleCellExperiment對(duì)象取募。估算過(guò)程包括以下步驟:
- 1.通過(guò)將伽馬分布擬合到平均表達(dá)式水平來(lái)估計(jì)均值參數(shù)。
- 2.通過(guò)對(duì)庫(kù)大小擬合對(duì)數(shù)正態(tài)分布來(lái)估計(jì)庫(kù)大小參數(shù)蟆技。
- 3.通過(guò)確定離群點(diǎn)的數(shù)量并將對(duì)數(shù)正態(tài)分布擬合到它們與中位數(shù)的差值來(lái)估計(jì)表達(dá)矩陣中離群點(diǎn)參數(shù)玩敏。
- 4.使用
edgeR
包中的estimateDisp
函數(shù)估計(jì)BCV
參數(shù)。 - 5.通過(guò)檢查是否存在dropout并將
logistic
函數(shù)擬合到均值表達(dá)式與零點(diǎn)比例之間的關(guān)系來(lái)估計(jì)丟失參數(shù)质礼。
有關(guān)評(píng)估程序的更多詳細(xì)信息旺聚,請(qǐng)參閱?splatEstimate
。
6. Simulating counts
一旦我們有了一組滿意的參數(shù)眶蕉,我們就可以使用splatSimulate
來(lái)模擬計(jì)數(shù)砰粹。如果我們想要對(duì)參數(shù)進(jìn)行小幅調(diào)整,我們可以將它們作為附加參數(shù)提供造挽,或者碱璃,如果我們不提供任何參數(shù)弄痹,則將使用默認(rèn)值:
sim <- splatSimulate(params, nGenes = 1000)
sim
#> class: SingleCellExperiment
#> dim: 1000 200
#> metadata(1): Params
#> assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
#> rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
#> rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
#> colnames(200): Cell1 Cell2 ... Cell199 Cell200
#> colData names(3): Cell Batch ExpLibSize
#> reducedDimNames(0):
#> spikeNames(0):
#> altExpNames(0):
查看splatSimulate
的輸出可以看出,sim
是具有1000個(gè)特征(基因)和200個(gè)樣本(細(xì)胞)的SingleCellExperient
對(duì)象嵌器。此對(duì)象的主要部分是包含模擬計(jì)數(shù)(使用counts
訪問(wèn))的按樣本特征矩陣界酒,盡管它也可以包含其他表達(dá)式(expression measures)度量,如FPKM或TPM嘴秸。此外毁欣,SingleCellExperient
包含每個(gè)細(xì)胞的表型信息(使用colData
訪問(wèn))和每個(gè)基因的特征信息(使用rowData
訪問(wèn))。Splatter使用這些slots以及assays
來(lái)存儲(chǔ)有關(guān)模擬中間值的信息岳掐。
# Access the counts
counts(sim)[1:5, 1:5]
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Gene1 142 390 110 188 521
#> Gene2 1473 431 1214 881 401
#> Gene3 753 667 940 379 469
#> Gene4 2136 2003 1557 1223 700
#> Gene5 887 529 139 213 1027
# Information about genes
head(rowData(sim))
#> DataFrame with 6 rows and 4 columns
#> Gene BaseGeneMean OutlierFactor GeneMean
#> <factor> <numeric> <numeric> <numeric>
#> Gene1 Gene1 149.828982351488 1 149.828982351488
#> Gene2 Gene2 316.265527453104 1 316.265527453104
#> Gene3 Gene3 208.039144740334 1 208.039144740334
#> Gene4 Gene4 713.244884835822 1 713.244884835822
#> Gene5 Gene5 171.193128303906 1 171.193128303906
#> Gene6 Gene6 147.896699333501 1 147.896699333501
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 3 columns
#> Cell Batch ExpLibSize
#> <factor> <character> <numeric>
#> Cell1 Cell1 Batch1 382540.907555357
#> Cell2 Cell2 Batch1 359403.82489391
#> Cell3 Cell3 Batch1 361406.811639421
#> Cell4 Cell4 Batch1 351539.053237628
#> Cell5 Cell5 Batch1 368536.896292943
#> Cell6 Cell6 Batch1 360324.111883911
# Gene by cell matrices
names(assays(sim))
#> [1] "BatchCellMeans" "BaseCellMeans" "BCV" "CellMeans"
#> [5] "TrueCounts" "counts"
# Example of cell means matrix
assays(sim)$CellMeans[1:5, 1:5]
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Gene1 141.2097 375.1557 102.9034 195.2901 510.1044
#> Gene2 1494.8964 421.8598 1181.7091 844.7958 420.4107
#> Gene3 739.8169 693.3015 986.7745 366.4702 467.8734
#> Gene4 2192.9068 2015.6644 1453.3987 1231.2689 735.8236
#> Gene5 900.2559 573.5933 146.7801 194.2355 983.0344
輸出SingleCellExperient
的另一個(gè)(很大)好處是我們可以立即訪問(wèn)其他分析包凭疮,比如scater
中的繪圖函數(shù)。例如串述,我們可以制作PCA圖:
# Use scater to calculate logcounts
sim <- logNormCounts(sim)
# Plot PCA
sim <- runPCA(sim)
plotPCA(sim)
注意:您的值和繪圖可能看起來(lái)不同执解,因?yàn)槟M是隨機(jī)的,每次運(yùn)行時(shí)都會(huì)產(chǎn)生不同的結(jié)果纲酗。
有關(guān)SingleCellExperient
對(duì)象的更多詳細(xì)信息衰腌,請(qǐng)參閱[vignette] SCE-vignette。有關(guān)您可以使用scater
執(zhí)行哪些操作的信息觅赊,請(qǐng)參閱相關(guān)的scater
文檔和網(wǎng)站vignette右蕊。
splatSimulate
函數(shù)輸出以下有關(guān)模擬的附加信息:
-
Cell information (
colData
)-
Cell
- Unique cell identifier. -
Group
- The group or path the cell belongs to. -
ExpLibSize
- The expected library size for that cell. -
Step
(paths only) - How far along the path each cell is.
-
-
Gene information (
rowData
)-
Gene
- Unique gene identifier. -
BaseGeneMean
- The base expression level for that gene. -
OutlierFactor
- Expression outlier factor for that gene (1 is not an outlier). -
GeneMean
- Expression level after applying outlier factors. -
DEFac
[Group] - The differential expression factor for each gene in a particular group (1 is not differentially expressed). -
GeneMean
[Group] - Expression level of a gene in a particular group after applying differential expression factors.
-
-
Gene by cell information (
assays
)-
BaseCellMeans
- The expression of genes in each cell adjusted for expected library size. -
BCV
- The Biological Coefficient of Variation for each gene in each cell. -
CellMeans
- The expression level of genes in each cell adjusted for BCV. -
TrueCounts
- The simulated counts before dropout. -
Dropout
- Logical matrix showing which counts have been dropped in which cells.
-
Splatter添加的值使用UpperCamelCase
命名,將其與scater
等包使用的underscore_naming
分開(kāi)吮螺。有關(guān)模擬的更多信息饶囚,請(qǐng)參見(jiàn)?splatSimulate
。
6.1 Simulating groups
到目前為止鸠补,我們只模擬了單個(gè)細(xì)胞群體萝风,但我們通常感興趣的是研究混合細(xì)胞群體,看看存在什么類型的細(xì)胞紫岩,或者它們之間有什么不同规惰。Splatter可以通過(guò)更改method
參數(shù)來(lái)模擬這些情況,這里我們將通過(guò)指定group.pro
參數(shù)并將method
參數(shù)設(shè)置為"groups"
來(lái)模擬兩個(gè)組:
注:我們還將
Verbose
參數(shù)設(shè)置為FALSE
泉蝌,以停止Splatter打印進(jìn)度消息歇万。
sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
verbose = FALSE)
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.groups, colour_by = "Group")
因?yàn)槲覀円呀?jīng)將兩個(gè)組的概率都設(shè)置為0.5,所以我們應(yīng)該在每個(gè)組中獲得大致相等的細(xì)胞數(shù)量(在本例中大約是50個(gè))梨与。如果我們想要不均勻的組堕花,我們可以將group.prob
設(shè)置為和為1的任何概率集。
6.2 Simulating paths
另一種通常令人感興趣的情況是分化過(guò)程粥鞋,其中一種細(xì)胞類型正在轉(zhuǎn)變?yōu)榱硪环N細(xì)胞類型缘挽。Splatter通過(guò)模擬兩組之間的一系列步驟并隨機(jī)將每個(gè)細(xì)胞分配到一個(gè)步驟來(lái)近似這一過(guò)程。我們可以使用"paths"
方法創(chuàng)建這種模擬。
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
verbose = FALSE)
sim.paths <- logNormCounts(sim.paths)
sim.paths <- runPCA(sim.paths)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.paths, colour_by = "Step")
這里的顏色代表了每個(gè)細(xì)胞的“step”壕曼,或者它沿著分化路徑走了多遠(yuǎn)苏研。我們可以看到,深色的細(xì)胞與原始細(xì)胞類型更相似腮郊,淺色的細(xì)胞更接近最終的分化細(xì)胞類型摹蘑。通過(guò)設(shè)置額外的參數(shù),可以模擬更復(fù)雜的過(guò)程(例如轧飞,來(lái)自單個(gè)祖細(xì)胞的多種成熟細(xì)胞類型)衅鹿。
6.3 Batch effects
在任何測(cè)序?qū)嶒?yàn)的分析中,另一個(gè)重要的因素是批次效應(yīng)过咬,這是一組同時(shí)處理的樣本共同存在的技術(shù)差異大渤。我們通過(guò)告訴Splatter每批中有多少個(gè)細(xì)胞來(lái)應(yīng)用批處理效果:
sim.batches <- splatSimulate(batchCells = c(50, 50), verbose = FALSE)
sim.batches <- logNormCounts(sim.batches)
sim.batches <- runPCA(sim.batches)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.batches, colour_by = "Batch")
這看起來(lái)很像我們模擬groups的時(shí)候,那是因?yàn)檫^(guò)程非常相似掸绞。不同之處在于泵三,批量效應(yīng)適用于所有基因,而不僅僅是那些差異表達(dá)的基因衔掸,而且影響通常較小烫幕。通過(guò)組合分組和批次,我們既可以模擬我們不感興趣的不需要的變化(batch)敞映,也可以模擬我們正在尋找的需要的變化(group):
sim.groups <- splatSimulate(batchCells = c(50, 50), group.prob = c(0.5, 0.5),
method = "groups", verbose = FALSE)
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.groups, shape_by = "Batch", colour_by = "Group")
在這里较曼,我們看到組(first component)的影響比批處理效果(second component)強(qiáng),但是通過(guò)調(diào)整參數(shù)驱显,我們可以使批處理效果占主導(dǎo)地位诗芜。
6.4 Convenience functions
每種Splatter模擬方法都有各自的便捷功能。模擬單個(gè)種群使用splatSimulateSingle()
(相當(dāng)于splatSimulate(method = "single")
)埃疫,使用splatSimulateGroups()
(相當(dāng)于splatSimulate(method = "groups")
)或模擬路徑使用splatSimulatePaths()
(相當(dāng)于splatSimulate(method = "paths")
)。
7. Other simulations
除了它自己的Splat
模擬方法孩哑,Splatter
包還包含其他已發(fā)表的單細(xì)胞RNA-seq
模擬的實(shí)現(xiàn)栓霜,或者包裝其他包中包含的模擬。要查看所有可用的模擬横蜒,請(qǐng)運(yùn)行listSims()
函數(shù):
listSims()
#> Splatter currently contains 14 simulations
#>
#> Splat (splat)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter
#> The Splat simulation generates means from a gamma distribution, adjusts them for BCV and generates counts from a gamma-poisson. Dropout and batch effects can be optionally added.
#>
#> Splat Single (splatSingle)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter
#> The Splat simulation with a single population.
#>
#> Splat Groups (splatGroups)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter
#> The Splat simulation with multiple groups. Each group can have it's own differential expression probability and fold change distribution.
#>
#> Splat Paths (splatPaths)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter
#> The Splat simulation with differentiation paths. Each path can have it's own length, skew and probability. Genes can change in non-linear ways.
#>
#> Kersplat (kersplat)
#> DOI: GitHub: Oshlack/splatter
#> The Kersplat simulation extends the Splat model by adding a gene network, more complex cell structure, doublets and empty cells (Experimental).
#>
#> Simple (simple)
#> DOI: 10.1186/s13059-017-1305-0 GitHub: Oshlack/splatter
#> A simple simulation with gamma means and negative binomial counts.
#>
#> Lun (lun)
#> DOI: 10.1186/s13059-016-0947-7 GitHub: MarioniLab/Deconvolution2016
#> Gamma distributed means and negative binomial counts. Cells are given a size factor and differential expression can be simulated with fixed fold changes.
#>
#> Lun 2 (lun2)
#> DOI: 10.1093/biostatistics/kxw055 GitHub: MarioniLab/PlateEffects2016
#> Negative binomial counts where the means and dispersions have been sampled from a real dataset. The core feature of the Lun 2 simulation is the addition of plate effects. Differential expression can be added between two groups of plates and optionally a zero-inflated negative-binomial can be used.
#>
#> scDD (scDD)
#> DOI: 10.1186/s13059-016-1077-y GitHub: kdkorthauer/scDD
#> The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions.
#>
#> BASiCS (BASiCS)
#> DOI: 10.1371/journal.pcbi.1004333 GitHub: catavallejos/BASiCS
#> The BASiCS simulation is based on a bayesian model used to deconvolve biological and technical variation and includes spike-ins and batch effects.
#>
#> mfa (mfa)
#> DOI: 10.12688/wellcomeopenres.11087.1 GitHub: kieranrcampbell/mfa
#> The mfa simulation produces a bifurcating pseudotime trajectory. This can optionally include genes with transient changes in expression and added dropout.
#>
#> PhenoPath (pheno)
#> DOI: 10.1101/159913 GitHub: kieranrcampbell/phenopath
#> The PhenoPath simulation produces a pseudotime trajectory with different types of genes.
#>
#> ZINB-WaVE (zinb)
#> DOI: 10.1101/125112 GitHub: drisso/zinbwave
#> The ZINB-WaVE simulation simulates counts from a sophisticated zero-inflated negative-binomial distribution including cell and gene-level covariates.
#>
#> SparseDC (sparseDC)
#> DOI: 10.1093/nar/gkx1113 GitHub: cran/SparseDC
#> The SparseDC simulation simulates a set of clusters across two conditions, where some clusters may be present in only one condition.
每模種擬都有其自己的前綴胳蛮,該前綴提供了與該模擬相關(guān)聯(lián)的函數(shù)的名稱。例如丛晌,簡(jiǎn)單模擬的前綴是simple
仅炊,因此它會(huì)將其參數(shù)存儲(chǔ)在一個(gè)SimpleParams
對(duì)象中,該對(duì)象可以使用newSimpleParams()
創(chuàng)建澎蛛,也可以使用simpleEstimate()
從實(shí)際數(shù)據(jù)中估計(jì)抚垄。要使用該模擬來(lái)模擬數(shù)據(jù),您可以使用simpleSimulate()
。每次模擬返回一個(gè)SingleCellExperient
對(duì)象呆馁,其中間值與splatSimulate()
返回的值類似桐经。有關(guān)每個(gè)模擬的更多詳細(xì)信息,請(qǐng)參閱相應(yīng)的幫助頁(yè)面(例如浙滤,?simpleSimulate
了解簡(jiǎn)單模擬的工作原理阴挣,或?lun2Estimate
了解Lun 2
模擬如何估計(jì)參數(shù)的詳細(xì)信息)或參考相應(yīng)的論文或包。
8. Other expression values
Splatter旨在模擬計(jì)數(shù)數(shù)據(jù)纺腊,但一些分析方法需要其他表達(dá)式值畔咧,特別是長(zhǎng)度歸一化的值,如TPM或FPKM揖膜。scater
包具有將這些值添加到一個(gè)SingleCellExperiment
對(duì)象中的函數(shù)盒卸,但是它們需要每個(gè)基因的長(zhǎng)度。addGeneLengths
函數(shù)可以用來(lái)模擬這些長(zhǎng)度:
sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(rowData(sim))
#> DataFrame with 6 rows and 3 columns
#> Gene GeneMean Length
#> <factor> <numeric> <numeric>
#> Gene1 Gene1 10.5022855005221 2520
#> Gene2 Gene2 1.15678349696842 729
#> Gene3 Gene3 1.52303353341067 6009
#> Gene4 Gene4 0.249127173451236 14184
#> Gene5 Gene5 1.3196168428461 3609
#> Gene6 Gene6 1.11978689263638 2468
然后次氨,我們可以使用scater
計(jì)算TPM
:
tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
tpm(sim)[1:5, 1:5]
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Gene1 556.8385 954.67814 251.01798 313.8751 771.18856
#> Gene2 213.8749 220.00813 216.92912 0.0000 222.15308
#> Gene3 0.0000 53.38190 26.31741 0.0000 0.00000
#> Gene4 0.0000 0.00000 0.00000 0.0000 0.00000
#> Gene5 172.8066 44.44055 131.45580 0.0000 44.87381
addGeneLengths
模擬長(zhǎng)度的默認(rèn)方法是從對(duì)數(shù)正態(tài)分布生成值蔽介,然后四舍五入為整數(shù)長(zhǎng)度。這種分布的參數(shù)是基于人類蛋白質(zhì)編碼基因的煮寡,但如果需要可以調(diào)整(例如虹蓄,對(duì)于其他物種)⌒宜海或者薇组,可以從提供的向量中采樣長(zhǎng)度(有關(guān)詳細(xì)信息和示例,請(qǐng)參閱?addGeneLengths
)坐儿。
9. Comparing simulations and real data
在模擬數(shù)據(jù)之后律胀,您可能想要做的一件事是將其與真實(shí)數(shù)據(jù)集進(jìn)行比較,或者比較具有不同參數(shù)或模型的模擬貌矿。Splatter提供了一個(gè)函數(shù)compareSCEs
炭菌,旨在簡(jiǎn)化這些比較。顧名思義逛漫,這個(gè)函數(shù)獲取一個(gè)SingleCellExperient
對(duì)象列表黑低,組合數(shù)據(jù)集,并生成一些比較它們的曲線圖酌毡。讓我們做兩個(gè)小的模擬克握,看看它們?nèi)绾伪容^。
sim1 <- splatSimulate(nGenes = 1000, batchCells = 20, verbose = FALSE)
sim2 <- simpleSimulate(nGenes = 1000, nCells = 20, verbose = FALSE)
comparison <- compareSCEs(list(Splat = sim1, Simple = sim2))
names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means" "Variances" "MeanVar" "LibrarySizes" "ZerosGene"
#> [6] "ZerosCell" "MeanZeros" "VarGeneCor"
返回的列表有三個(gè)項(xiàng)目枷踏。前兩個(gè)是按基因(RowData
)和按細(xì)胞(ColData
)組合的數(shù)據(jù)集菩暗,第三個(gè)包含一些比較圖(使用ggplot2
生成),例如均值分布圖:
comparison$Plots$Means
這些只是您可能想要考慮的幾個(gè)曲線圖旭蠕,但是使用返回的數(shù)據(jù)應(yīng)該很容易做出更多的曲線圖停团。例如旷坦,我們可以繪制表達(dá)基因的數(shù)量與文庫(kù)大小的關(guān)系圖:
library("ggplot2")
ggplot(comparison$ColData, aes(x = sum, y = detected, colour = Dataset)) +
geom_point()
9.1 Comparing differences
有時(shí),查看它們之間的差異可能更有趣客蹋,而不是直觀地比較數(shù)據(jù)集塞蹭。我們可以使用diffSCEs
函數(shù)來(lái)實(shí)現(xiàn)這一點(diǎn)。與compareSCEs
類似讶坯,該函數(shù)接受SingleCellExperient
對(duì)象的列表番电,但是現(xiàn)在我們還指定了一個(gè)作為引用。返回了一系列類似的曲線圖辆琅,但它們并沒(méi)有顯示總體分布漱办,而是顯示了與參考數(shù)據(jù)的不同之處。
difference <- diffSCEs(list(Splat = sim1, Simple = sim2), ref = "Simple")
difference$Plots$Means
我們還得到了一系列可以用來(lái)比較分布的分位數(shù)-分位數(shù)(Q-Q)曲線圖婉烟。
difference$QQPlots$Means
9.2 Making panels
這些比較中的每一個(gè)都有幾個(gè)圖娩井,可能會(huì)有很多值得關(guān)注的地方。為了簡(jiǎn)化操作似袁,或者為出版物制作圖形嗅辣,可以使用函數(shù)makeCompPanel
躏仇、makeDiffPanel
和makeOverallPanel
伪阶。 這些函數(shù)使用cowplot
包將繪圖合并到單個(gè)panel中歼捏。panel可能非常大,很難查看(例如而涉,在RStudio的繪圖查看器中)著瓶,因此最好輸出panel并分別查看它們。幸運(yùn)的是啼县,cowplot
提供了一個(gè)方便的保存圖像的功能材原。以下是輸出每個(gè)panel的一些建議參數(shù):
# # This code is just an example and is not run
# panel <- makeCompPanel(comparison)
# cowplot::save_plot("comp_panel.png", panel, nrow = 4, ncol = 3)
#
# panel <- makeDiffPanel(difference)
# cowplot::save_plot("diff_panel.png", panel, nrow = 3, ncol = 5)
#
# panel <- makeOverallPanel(comparison, difference)
# cowplot::save_plot("overall_panel.png", panel, ncol = 4, nrow = 7)
10. Citing Splatter
如果您在工作中使用 Splatter,請(qǐng)引用這篇的論文:
citation("splatter")
#>
#> Zappia L, Phipson B, Oshlack A. Splatter: Simulation of single-cell
#> RNA sequencing data. Genome Biology. 2017;
#> doi:10.1186/s13059-017-1305-0
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> author = {Luke Zappia and Belinda Phipson and Alicia Oshlack},
#> title = {Splatter: simulation of single-cell RNA sequencing data},
#> journal = {Genome Biology},
#> year = {2017},
#> url = {http://dx.doi.org/10.1186/s13059-017-1305-0},
#> doi = {10.1186/s13059-017-1305-0},
#> }
Session information
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] scater_1.14.1 ggplot2_3.3.1
#> [3] splatter_1.10.1 SingleCellExperiment_1.8.0
#> [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
#> [7] BiocParallel_1.20.0 matrixStats_0.56.0
#> [9] Biobase_2.46.0 GenomicRanges_1.38.0
#> [11] GenomeInfoDb_1.22.0 IRanges_2.20.1
#> [13] S4Vectors_0.24.0 BiocGenerics_0.32.0
#>
#> loaded via a namespace (and not attached):
#> [1] viridis_0.5.1 edgeR_3.28.0 BiocSingular_1.2.0
#> [4] viridisLite_0.3.0 splines_3.6.1 DelayedMatrixStats_1.8.0
#> [7] sp_1.4-2 GenomeInfoDbData_1.2.2 vipor_0.4.5
#> [10] yaml_2.2.1 pillar_1.4.4 backports_1.1.7
#> [13] lattice_0.20-41 glue_1.4.1 limma_3.42.2
#> [16] digest_0.6.25 RColorBrewer_1.1-2 XVector_0.26.0
#> [19] checkmate_2.0.0 colorspace_1.4-1 cowplot_1.0.0
#> [22] htmltools_0.4.0 Matrix_1.2-18 pkgconfig_2.0.3
#> [25] zlibbioc_1.32.0 purrr_0.3.4 scales_1.1.1
#> [28] tibble_3.0.1 generics_0.0.2 farver_2.0.3
#> [31] ellipsis_0.3.1 withr_2.2.0 survival_3.1-12
#> [34] magrittr_1.5 crayon_1.3.4 evaluate_0.14
#> [37] MASS_7.3-51.6 beeswarm_0.2.3 tools_3.6.1
#> [40] fitdistrplus_1.1-1 lifecycle_0.2.0 stringr_1.4.0
#> [43] munsell_0.5.0 locfit_1.5-9.4 irlba_2.3.3
#> [46] akima_0.6-2.1 compiler_3.6.1 rsvd_1.0.3
#> [49] rlang_0.4.6 grid_3.6.1 RCurl_1.98-1.2
#> [52] BiocNeighbors_1.4.1 bitops_1.0-6 labeling_0.3
#> [55] rmarkdown_2.2 gtable_0.3.0 R6_2.4.1
#> [58] gridExtra_2.3 knitr_1.28 dplyr_1.0.0
#> [61] stringi_1.4.6 ggbeeswarm_0.6.0 Rcpp_1.0.4
#> [64] vctrs_0.3.1 tidyselect_1.1.0 xfun_0.14