2020-06-15 `Splatter`是一個(gè)用于簡(jiǎn)單模擬scRNA-seq數(shù)據(jù)的R包

splatter

clp

15 June, 2020

Splatter是一個(gè)用于簡(jiǎn)單模擬單細(xì)胞RNA測(cè)序數(shù)據(jù)的R包锉矢，本文概述并介紹Splatter的功能。

1. 安裝R包

2種安裝方法铜靶，1個(gè)是從Bioconductor安裝宦芦，一個(gè)是從GitHub安裝。

#  if (!requireNamespace("BiocManager", quietly=TRUE))
#      install.packages("BiocManager")
#  BiocManager::install("splatter")

#  BiocManager::install("Oshlack/splatter", dependencies = TRUE,
#                       build_vignettes = TRUE)

假設(shè)您已經(jīng)有了一個(gè)與您想要模擬的計(jì)數(shù)數(shù)據(jù)矩陣類似的矩陣呐籽，使用Splatter創(chuàng)建模擬數(shù)據(jù)集有兩個(gè)簡(jiǎn)單的步驟盏浇。以下是使用scater包生成的模擬數(shù)據(jù)集的示例：

2. quickstart

# Load package
library(splatter)
#> Warning: package 'splatter' was built under R version 3.6.2

# Create mock data
library(scater)
#> Warning: package 'ggplot2' was built under R version 3.6.2
set.seed(1)
sce <- mockSCE()

# Estimate parameters from mock data
params <- splatEstimate(sce)

# Simulate data using estimated parameters
sim <- splatSimulate(params)

這些步驟將在以下各節(jié)中詳細(xì)說(shuō)明变丧，但簡(jiǎn)要地說(shuō)，第一步采用數(shù)據(jù)集并從中估計(jì)模擬參數(shù)绢掰，第二步采用這些參數(shù)并模擬新的數(shù)據(jù)集痒蓬。

3. The Splat simulation

在了解如何估計(jì)參數(shù)之前，讓我們先看看Splatter是如何模擬數(shù)據(jù)的滴劲，以及這些參數(shù)是什么攻晒。我們使用術(shù)語(yǔ)Splat來(lái)指代Splatter自己的模擬，并將其與軟件包本身區(qū)分開(kāi)來(lái)班挖。Splat模型的核心是伽馬-泊松分布鲁捏，用于通過(guò)細(xì)胞計(jì)數(shù)矩陣生成基因。從gamma distribution模擬每個(gè)基因的平均表達(dá)水平萧芙，并且在從Poisson distribution模擬計(jì)數(shù)之前给梅，使用生物變異系數(shù)來(lái)強(qiáng)制均值-方差趨勢(shì)假丧。Splat還允許您模擬表達(dá)異常基因(具有伽馬分布之外的平均表達(dá)的基因)和丟棄(dropout)(基于平均表達(dá)式隨機(jī)剔除計(jì)數(shù))动羽。每個(gè)單細(xì)胞都有一個(gè)預(yù)期的庫(kù)大小(從對(duì)數(shù)正態(tài)分布模擬)包帚，這使得它更容易與給定的數(shù)據(jù)集匹配。

Splat還可以模擬不同類型細(xì)胞之間的差異表達(dá)或不同細(xì)胞類型之間的分化路徑运吓，其中表達(dá)以連續(xù)的方式變化渴邦。這些內(nèi)容將在simulating counts一節(jié)中進(jìn)一步介紹。

4. The `SplatParams` object

Splat模擬的所有參數(shù)都存儲(chǔ)在SplatParams對(duì)象中拘哨。讓我們創(chuàng)建一個(gè)新的谋梭，看看它是什么樣子。

params <- newSplatParams()
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (Genes)  (Cells)   [Seed] 
#>   10000      100   712777 
#> 
#> 28 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale] 
#>             1            100            0.1            0.1 
#> 
#> Mean: 
#>  (Rate)  (Shape) 
#>     0.3      0.6 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [Probability]    [Down Prob]     [Location]        [Scale] 
#>           0.1            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

除了告訴我們有什么類型的對(duì)象(“A Params object of class SplatParams”)并向我們顯示參數(shù)的值之外倦青，此輸出還為我們提供了一些額外的信息瓮床。我們可以看到，哪些參數(shù)可以通過(guò)splatEstimate函數(shù)估計(jì)(those in parentheses)姨夹，哪些參數(shù)不能估計(jì)(those in brackets)纤垂，哪些參數(shù)的默認(rèn)值已經(jīng)改變(those in ALL CAPS)矾策。有關(guān)Splat模擬參數(shù)的更多詳細(xì)信息磷账，請(qǐng)參閱Splat parameters vignette。

4.1 Getting and setting

getParam(params, "nGenes")
#> [1] 10000

或者贾虽，可以使用setParam函數(shù)為參數(shù)賦予新值：

params <- setParam(params, "nGenes", 5000)
getParam(params, "nGenes")
#> [1] 5000

如果要提取多個(gè)參數(shù)(列表)或設(shè)置多個(gè)參數(shù)逃糟，可以使用getParams或setParams函數(shù)：

# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
#> $nGenes
#> [1] 8000
#> 
#> $mean.rate
#> [1] 0.5
#> 
#> $mean.shape
#> [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (GENES)  (Cells)   [Seed] 
#>    8000      100   712777 
#> 
#> 28 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale] 
#>             1            100            0.1            0.1 
#> 
#> Mean: 
#>  (RATE)  (SHAPE) 
#>     0.5      0.5 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [PROBABILITY]    [Down Prob]     [Location]        [Scale] 
#>           0.2            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

帶有已更改的參數(shù)現(xiàn)在顯示在所有大寫字母(ALL CAPS)中，以指示它們已從默認(rèn)值更改蓬豁。我們也可以在調(diào)用newSplatParams時(shí)直接設(shè)置參數(shù)：

params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
getParams(params, c("lib.loc", "lib.scale"))
#> $lib.loc
#> [1] 12
#> 
#> $lib.scale
#> [1] 0.6

5. Estimating parameters

Splat允許您使用splatEstimate函數(shù)從包含計(jì)數(shù)的數(shù)據(jù)集中估計(jì)它的許多參數(shù)绰咽。

# Get the mock counts matrix
counts <- counts(sce)

# Check that counts is an integer matrix
class(counts)
#> [1] "matrix"
typeof(counts)
#> [1] "double"

# Check the dimensions, each row is a gene, each column is a cell
dim(counts)
#> [1] 2000  200

# Show the first few entries
counts[1:5, 1:5]
#>           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
#> Gene_0001        0        5        7      276       50
#> Gene_0002       12        0        0        0        0
#> Gene_0003       97      292       58       64      541
#> Gene_0004        0        0        0      170       19
#> Gene_0005      105      123      174      565     1061

params <- splatEstimate(counts)

在這里，我們從計(jì)數(shù)矩陣估計(jì)參數(shù)地粪，但是splatEstimate也可以接受SingleCellExperiment對(duì)象取募。估算過(guò)程包括以下步驟：

1.通過(guò)將伽馬分布擬合到平均表達(dá)式水平來(lái)估計(jì)均值參數(shù)。
2.通過(guò)對(duì)庫(kù)大小擬合對(duì)數(shù)正態(tài)分布來(lái)估計(jì)庫(kù)大小參數(shù)蟆技。
3.通過(guò)確定離群點(diǎn)的數(shù)量并將對(duì)數(shù)正態(tài)分布擬合到它們與中位數(shù)的差值來(lái)估計(jì)表達(dá)矩陣中離群點(diǎn)參數(shù)玩敏。
4.使用edgeR包中的estimateDisp函數(shù)估計(jì)BCV參數(shù)。
5.通過(guò)檢查是否存在dropout并將logistic函數(shù)擬合到均值表達(dá)式與零點(diǎn)比例之間的關(guān)系來(lái)估計(jì)丟失參數(shù)质礼。

有關(guān)評(píng)估程序的更多詳細(xì)信息旺聚，請(qǐng)參閱?splatEstimate。

6. Simulating counts

一旦我們有了一組滿意的參數(shù)眶蕉，我們就可以使用splatSimulate來(lái)模擬計(jì)數(shù)砰粹。如果我們想要對(duì)參數(shù)進(jìn)行小幅調(diào)整，我們可以將它們作為附加參數(shù)提供造挽，或者碱璃，如果我們不提供任何參數(shù)弄痹，則將使用默認(rèn)值：

sim <- splatSimulate(params, nGenes = 1000)
sim
#> class: SingleCellExperiment 
#> dim: 1000 200 
#> metadata(1): Params
#> assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
#> rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
#> rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
#> colnames(200): Cell1 Cell2 ... Cell199 Cell200
#> colData names(3): Cell Batch ExpLibSize
#> reducedDimNames(0):
#> spikeNames(0):
#> altExpNames(0):

查看splatSimulate的輸出可以看出，sim是具有1000個(gè)特征(基因)和200個(gè)樣本(細(xì)胞)的SingleCellExperient對(duì)象嵌器。此對(duì)象的主要部分是包含模擬計(jì)數(shù)(使用counts訪問(wèn))的按樣本特征矩陣界酒，盡管它也可以包含其他表達(dá)式(expression measures)度量，如FPKM或TPM嘴秸。此外毁欣，SingleCellExperient包含每個(gè)細(xì)胞的表型信息(使用colData訪問(wèn))和每個(gè)基因的特征信息(使用rowData訪問(wèn))。Splatter使用這些slots以及assays來(lái)存儲(chǔ)有關(guān)模擬中間值的信息岳掐。

# Access the counts
counts(sim)[1:5, 1:5]
#>       Cell1 Cell2 Cell3 Cell4 Cell5
#> Gene1   142   390   110   188   521
#> Gene2  1473   431  1214   881   401
#> Gene3   753   667   940   379   469
#> Gene4  2136  2003  1557  1223   700
#> Gene5   887   529   139   213  1027
# Information about genes
head(rowData(sim))
#> DataFrame with 6 rows and 4 columns
#>           Gene     BaseGeneMean OutlierFactor         GeneMean
#>       <factor>        <numeric>     <numeric>        <numeric>
#> Gene1    Gene1 149.828982351488             1 149.828982351488
#> Gene2    Gene2 316.265527453104             1 316.265527453104
#> Gene3    Gene3 208.039144740334             1 208.039144740334
#> Gene4    Gene4 713.244884835822             1 713.244884835822
#> Gene5    Gene5 171.193128303906             1 171.193128303906
#> Gene6    Gene6 147.896699333501             1 147.896699333501
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 3 columns
#>           Cell       Batch       ExpLibSize
#>       <factor> <character>        <numeric>
#> Cell1    Cell1      Batch1 382540.907555357
#> Cell2    Cell2      Batch1  359403.82489391
#> Cell3    Cell3      Batch1 361406.811639421
#> Cell4    Cell4      Batch1 351539.053237628
#> Cell5    Cell5      Batch1 368536.896292943
#> Cell6    Cell6      Batch1 360324.111883911
# Gene by cell matrices
names(assays(sim))
#> [1] "BatchCellMeans" "BaseCellMeans"  "BCV"            "CellMeans"     
#> [5] "TrueCounts"     "counts"
# Example of cell means matrix
assays(sim)$CellMeans[1:5, 1:5]
#>           Cell1     Cell2     Cell3     Cell4    Cell5
#> Gene1  141.2097  375.1557  102.9034  195.2901 510.1044
#> Gene2 1494.8964  421.8598 1181.7091  844.7958 420.4107
#> Gene3  739.8169  693.3015  986.7745  366.4702 467.8734
#> Gene4 2192.9068 2015.6644 1453.3987 1231.2689 735.8236
#> Gene5  900.2559  573.5933  146.7801  194.2355 983.0344

輸出SingleCellExperient的另一個(gè)(很大)好處是我們可以立即訪問(wèn)其他分析包凭疮，比如scater中的繪圖函數(shù)。例如串述，我們可以制作PCA圖：

# Use scater to calculate logcounts
sim <- logNormCounts(sim)
# Plot PCA
sim <- runPCA(sim)
plotPCA(sim)

注意：您的值和繪圖可能看起來(lái)不同执解，因?yàn)槟M是隨機(jī)的，每次運(yùn)行時(shí)都會(huì)產(chǎn)生不同的結(jié)果纲酗。

有關(guān)SingleCellExperient對(duì)象的更多詳細(xì)信息衰腌，請(qǐng)參閱[vignette] SCE-vignette。有關(guān)您可以使用scater執(zhí)行哪些操作的信息觅赊，請(qǐng)參閱相關(guān)的scater文檔和網(wǎng)站vignette右蕊。
splatSimulate函數(shù)輸出以下有關(guān)模擬的附加信息：

Cell information (colData)
- Cell - Unique cell identifier.
- Group - The group or path the cell belongs to.
- ExpLibSize - The expected library size for that cell.
- Step (paths only) - How far along the path each cell is.
Gene information (rowData)
- Gene - Unique gene identifier.
- BaseGeneMean - The base expression level for that gene.
- OutlierFactor - Expression outlier factor for that gene (1 is not an outlier).
- GeneMean - Expression level after applying outlier factors.
- DEFac[Group] - The differential expression factor for each gene in a particular group (1 is not differentially expressed).
- GeneMean[Group] - Expression level of a gene in a particular group after applying differential expression factors.
Gene by cell information (assays)
- BaseCellMeans - The expression of genes in each cell adjusted for expected library size.
- BCV - The Biological Coefficient of Variation for each gene in each cell.
- CellMeans - The expression level of genes in each cell adjusted for BCV.
- TrueCounts - The simulated counts before dropout.
- Dropout - Logical matrix showing which counts have been dropped in which cells.

Splatter添加的值使用UpperCamelCase命名，將其與scater等包使用的underscore_naming分開(kāi)吮螺。有關(guān)模擬的更多信息饶囚，請(qǐng)參見(jiàn)?splatSimulate。

6.1 Simulating groups

到目前為止鸠补，我們只模擬了單個(gè)細(xì)胞群體萝风，但我們通常感興趣的是研究混合細(xì)胞群體，看看存在什么類型的細(xì)胞紫岩，或者它們之間有什么不同规惰。Splatter可以通過(guò)更改method參數(shù)來(lái)模擬這些情況，這里我們將通過(guò)指定group.pro參數(shù)并將method參數(shù)設(shè)置為"groups"來(lái)模擬兩個(gè)組：

注：我們還將Verbose參數(shù)設(shè)置為FALSE泉蝌，以停止Splatter打印進(jìn)度消息歇万。

sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
                            verbose = FALSE)
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.groups, colour_by = "Group")

image.png

因?yàn)槲覀円呀?jīng)將兩個(gè)組的概率都設(shè)置為0.5，所以我們應(yīng)該在每個(gè)組中獲得大致相等的細(xì)胞數(shù)量(在本例中大約是50個(gè))梨与。如果我們想要不均勻的組堕花，我們可以將group.prob設(shè)置為和為1的任何概率集。

6.2 Simulating paths

另一種通常令人感興趣的情況是分化過(guò)程粥鞋，其中一種細(xì)胞類型正在轉(zhuǎn)變?yōu)榱硪环N細(xì)胞類型缘挽。Splatter通過(guò)模擬兩組之間的一系列步驟并隨機(jī)將每個(gè)細(xì)胞分配到一個(gè)步驟來(lái)近似這一過(guò)程。我們可以使用"paths"方法創(chuàng)建這種模擬。

sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
                           verbose = FALSE)
sim.paths <- logNormCounts(sim.paths)
sim.paths <- runPCA(sim.paths)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.paths, colour_by = "Step")

image.png

這里的顏色代表了每個(gè)細(xì)胞的“step”壕曼，或者它沿著分化路徑走了多遠(yuǎn)苏研。我們可以看到，深色的細(xì)胞與原始細(xì)胞類型更相似腮郊，淺色的細(xì)胞更接近最終的分化細(xì)胞類型摹蘑。通過(guò)設(shè)置額外的參數(shù)，可以模擬更復(fù)雜的過(guò)程(例如轧飞，來(lái)自單個(gè)祖細(xì)胞的多種成熟細(xì)胞類型)衅鹿。

6.3 Batch effects

在任何測(cè)序?qū)嶒?yàn)的分析中，另一個(gè)重要的因素是批次效應(yīng)过咬，這是一組同時(shí)處理的樣本共同存在的技術(shù)差異大渤。我們通過(guò)告訴Splatter每批中有多少個(gè)細(xì)胞來(lái)應(yīng)用批處理效果：

sim.batches <- splatSimulate(batchCells = c(50, 50), verbose = FALSE)
sim.batches <- logNormCounts(sim.batches)
sim.batches <- runPCA(sim.batches)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.batches, colour_by = "Batch")

image.png

這看起來(lái)很像我們模擬groups的時(shí)候，那是因?yàn)檫^(guò)程非常相似掸绞。不同之處在于泵三，批量效應(yīng)適用于所有基因，而不僅僅是那些差異表達(dá)的基因衔掸，而且影響通常較小烫幕。通過(guò)組合分組和批次，我們既可以模擬我們不感興趣的不需要的變化(batch)敞映，也可以模擬我們正在尋找的需要的變化(group)：

sim.groups <- splatSimulate(batchCells = c(50, 50), group.prob = c(0.5, 0.5),
                            method = "groups", verbose = FALSE)
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.groups, shape_by = "Batch", colour_by = "Group")

image.png

在這里较曼，我們看到組(first component)的影響比批處理效果(second component)強(qiáng)，但是通過(guò)調(diào)整參數(shù)驱显，我們可以使批處理效果占主導(dǎo)地位诗芜。

6.4 Convenience functions

每種Splatter模擬方法都有各自的便捷功能。模擬單個(gè)種群使用splatSimulateSingle()(相當(dāng)于splatSimulate(method = "single"))埃疫，使用splatSimulateGroups()(相當(dāng)于splatSimulate(method = "groups"))或模擬路徑使用splatSimulatePaths()(相當(dāng)于splatSimulate(method = "paths"))。

7. Other simulations

除了它自己的Splat模擬方法孩哑，Splatter包還包含其他已發(fā)表的單細(xì)胞RNA-seq模擬的實(shí)現(xiàn)栓霜，或者包裝其他包中包含的模擬。要查看所有可用的模擬横蜒，請(qǐng)運(yùn)行listSims()函數(shù)：

listSims()
#> Splatter currently contains 14 simulations 
#> 
#> Splat (splat) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation generates means from a gamma distribution, adjusts them for BCV and generates counts from a gamma-poisson. Dropout and batch effects can be optionally added. 
#> 
#> Splat Single (splatSingle) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation with a single population. 
#> 
#> Splat Groups (splatGroups) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation with multiple groups. Each group can have it's own differential expression probability and fold change distribution. 
#> 
#> Splat Paths (splatPaths) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation with differentiation paths. Each path can have it's own length, skew and probability. Genes can change in non-linear ways. 
#> 
#> Kersplat (kersplat) 
#> DOI:      GitHub: Oshlack/splatter 
#> The Kersplat simulation extends the Splat model by adding a gene network, more complex cell structure, doublets and empty cells (Experimental). 
#> 
#> Simple (simple) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> A simple simulation with gamma means and negative binomial counts. 
#> 
#> Lun (lun) 
#> DOI: 10.1186/s13059-016-0947-7    GitHub: MarioniLab/Deconvolution2016 
#> Gamma distributed means and negative binomial counts. Cells are given a size factor and differential expression can be simulated with fixed fold changes. 
#> 
#> Lun 2 (lun2) 
#> DOI: 10.1093/biostatistics/kxw055     GitHub: MarioniLab/PlateEffects2016 
#> Negative binomial counts where the means and dispersions have been sampled from a real dataset. The core feature of the Lun 2 simulation is the addition of plate effects. Differential expression can be added between two groups of plates and optionally a zero-inflated negative-binomial can be used. 
#> 
#> scDD (scDD) 
#> DOI: 10.1186/s13059-016-1077-y    GitHub: kdkorthauer/scDD 
#> The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions. 
#> 
#> BASiCS (BASiCS) 
#> DOI: 10.1371/journal.pcbi.1004333     GitHub: catavallejos/BASiCS 
#> The BASiCS simulation is based on a bayesian model used to deconvolve biological and technical variation and includes spike-ins and batch effects. 
#> 
#> mfa (mfa) 
#> DOI: 10.12688/wellcomeopenres.11087.1     GitHub: kieranrcampbell/mfa 
#> The mfa simulation produces a bifurcating pseudotime trajectory. This can optionally include genes with transient changes in expression and added dropout. 
#> 
#> PhenoPath (pheno) 
#> DOI: 10.1101/159913   GitHub: kieranrcampbell/phenopath 
#> The PhenoPath simulation produces a pseudotime trajectory with different types of genes. 
#> 
#> ZINB-WaVE (zinb) 
#> DOI: 10.1101/125112   GitHub: drisso/zinbwave 
#> The ZINB-WaVE simulation simulates counts from a sophisticated zero-inflated negative-binomial distribution including cell and gene-level covariates. 
#> 
#> SparseDC (sparseDC) 
#> DOI: 10.1093/nar/gkx1113      GitHub: cran/SparseDC 
#> The SparseDC simulation simulates a set of clusters across two conditions, where some clusters may be present in only one condition.

每模種擬都有其自己的前綴胳蛮，該前綴提供了與該模擬相關(guān)聯(lián)的函數(shù)的名稱。例如丛晌，簡(jiǎn)單模擬的前綴是simple仅炊，因此它會(huì)將其參數(shù)存儲(chǔ)在一個(gè)SimpleParams對(duì)象中，該對(duì)象可以使用newSimpleParams()創(chuàng)建澎蛛，也可以使用simpleEstimate()從實(shí)際數(shù)據(jù)中估計(jì)抚垄。要使用該模擬來(lái)模擬數(shù)據(jù)，您可以使用simpleSimulate()。每次模擬返回一個(gè)SingleCellExperient對(duì)象呆馁，其中間值與splatSimulate()返回的值類似桐经。有關(guān)每個(gè)模擬的更多詳細(xì)信息，請(qǐng)參閱相應(yīng)的幫助頁(yè)面(例如浙滤，?simpleSimulate了解簡(jiǎn)單模擬的工作原理阴挣，或?lun2Estimate了解Lun 2模擬如何估計(jì)參數(shù)的詳細(xì)信息)或參考相應(yīng)的論文或包。

8. Other expression values

Splatter旨在模擬計(jì)數(shù)數(shù)據(jù)纺腊，但一些分析方法需要其他表達(dá)式值畔咧，特別是長(zhǎng)度歸一化的值，如TPM或FPKM揖膜。scater包具有將這些值添加到一個(gè)SingleCellExperiment對(duì)象中的函數(shù)盒卸，但是它們需要每個(gè)基因的長(zhǎng)度。addGeneLengths函數(shù)可以用來(lái)模擬這些長(zhǎng)度：

sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(rowData(sim))
#> DataFrame with 6 rows and 3 columns
#>           Gene          GeneMean    Length
#>       <factor>         <numeric> <numeric>
#> Gene1    Gene1  10.5022855005221      2520
#> Gene2    Gene2  1.15678349696842       729
#> Gene3    Gene3  1.52303353341067      6009
#> Gene4    Gene4 0.249127173451236     14184
#> Gene5    Gene5   1.3196168428461      3609
#> Gene6    Gene6  1.11978689263638      2468

然后次氨，我們可以使用scater計(jì)算TPM：

tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
tpm(sim)[1:5, 1:5]
#>          Cell1     Cell2     Cell3    Cell4     Cell5
#> Gene1 556.8385 954.67814 251.01798 313.8751 771.18856
#> Gene2 213.8749 220.00813 216.92912   0.0000 222.15308
#> Gene3   0.0000  53.38190  26.31741   0.0000   0.00000
#> Gene4   0.0000   0.00000   0.00000   0.0000   0.00000
#> Gene5 172.8066  44.44055 131.45580   0.0000  44.87381

addGeneLengths模擬長(zhǎng)度的默認(rèn)方法是從對(duì)數(shù)正態(tài)分布生成值蔽介，然后四舍五入為整數(shù)長(zhǎng)度。這種分布的參數(shù)是基于人類蛋白質(zhì)編碼基因的煮寡，但如果需要可以調(diào)整(例如虹蓄，對(duì)于其他物種)⌒宜海或者薇组，可以從提供的向量中采樣長(zhǎng)度(有關(guān)詳細(xì)信息和示例，請(qǐng)參閱?addGeneLengths)坐儿。

9. Comparing simulations and real data

在模擬數(shù)據(jù)之后律胀，您可能想要做的一件事是將其與真實(shí)數(shù)據(jù)集進(jìn)行比較，或者比較具有不同參數(shù)或模型的模擬貌矿。Splatter提供了一個(gè)函數(shù)compareSCEs炭菌，旨在簡(jiǎn)化這些比較。顧名思義逛漫，這個(gè)函數(shù)獲取一個(gè)SingleCellExperient對(duì)象列表黑低，組合數(shù)據(jù)集，并生成一些比較它們的曲線圖酌毡。讓我們做兩個(gè)小的模擬克握，看看它們?nèi)绾伪容^。

sim1 <- splatSimulate(nGenes = 1000, batchCells = 20, verbose = FALSE)
sim2 <- simpleSimulate(nGenes = 1000, nCells = 20, verbose = FALSE)
comparison <- compareSCEs(list(Splat = sim1, Simple = sim2))

names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means"        "Variances"    "MeanVar"      "LibrarySizes" "ZerosGene"   
#> [6] "ZerosCell"    "MeanZeros"    "VarGeneCor"

返回的列表有三個(gè)項(xiàng)目枷踏。前兩個(gè)是按基因(RowData)和按細(xì)胞(ColData)組合的數(shù)據(jù)集菩暗，第三個(gè)包含一些比較圖(使用ggplot2生成)，例如均值分布圖：

comparison$Plots$Means

image.png

這些只是您可能想要考慮的幾個(gè)曲線圖旭蠕，但是使用返回的數(shù)據(jù)應(yīng)該很容易做出更多的曲線圖停团。例如旷坦，我們可以繪制表達(dá)基因的數(shù)量與文庫(kù)大小的關(guān)系圖：

library("ggplot2")
ggplot(comparison$ColData, aes(x = sum, y = detected, colour = Dataset)) +
    geom_point()

image.png

9.1 Comparing differences

有時(shí)，查看它們之間的差異可能更有趣客蹋，而不是直觀地比較數(shù)據(jù)集塞蹭。我們可以使用diffSCEs函數(shù)來(lái)實(shí)現(xiàn)這一點(diǎn)。與compareSCEs類似讶坯，該函數(shù)接受SingleCellExperient對(duì)象的列表番电，但是現(xiàn)在我們還指定了一個(gè)作為引用。返回了一系列類似的曲線圖辆琅，但它們并沒(méi)有顯示總體分布漱办，而是顯示了與參考數(shù)據(jù)的不同之處。

difference <- diffSCEs(list(Splat = sim1, Simple = sim2), ref = "Simple")
difference$Plots$Means

image.png

我們還得到了一系列可以用來(lái)比較分布的分位數(shù)-分位數(shù)(Q-Q)曲線圖婉烟。

difference$QQPlots$Means

image.png

9.2 Making panels

這些比較中的每一個(gè)都有幾個(gè)圖娩井，可能會(huì)有很多值得關(guān)注的地方。為了簡(jiǎn)化操作似袁，或者為出版物制作圖形嗅辣，可以使用函數(shù)makeCompPanel躏仇、makeDiffPanel和makeOverallPanel伪阶。這些函數(shù)使用cowplot包將繪圖合并到單個(gè)panel中歼捏。panel可能非常大，很難查看(例如而涉，在RStudio的繪圖查看器中)著瓶，因此最好輸出panel并分別查看它們。幸運(yùn)的是啼县，cowplot提供了一個(gè)方便的保存圖像的功能材原。以下是輸出每個(gè)panel的一些建議參數(shù)：

#  # This code is just an example and is not run
#  panel <- makeCompPanel(comparison)
#  cowplot::save_plot("comp_panel.png", panel, nrow = 4, ncol = 3)
#  
#  panel <- makeDiffPanel(difference)
#  cowplot::save_plot("diff_panel.png", panel, nrow = 3, ncol = 5)
#  
#  panel <- makeOverallPanel(comparison, difference)
#  cowplot::save_plot("overall_panel.png", panel, ncol = 4, nrow = 7)

10. Citing Splatter

如果您在工作中使用 Splatter，請(qǐng)引用這篇的論文：

citation("splatter")
#> 
#>   Zappia L, Phipson B, Oshlack A. Splatter: Simulation of single-cell
#>   RNA sequencing data. Genome Biology. 2017;
#>   doi:10.1186/s13059-017-1305-0
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     author = {Luke Zappia and Belinda Phipson and Alicia Oshlack},
#>     title = {Splatter: simulation of single-cell RNA sequencing data},
#>     journal = {Genome Biology},
#>     year = {2017},
#>     url = {http://dx.doi.org/10.1186/s13059-017-1305-0},
#>     doi = {10.1186/s13059-017-1305-0},
#>   }

Session information

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] scater_1.14.1               ggplot2_3.3.1              
#>  [3] splatter_1.10.1             SingleCellExperiment_1.8.0 
#>  [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
#>  [7] BiocParallel_1.20.0         matrixStats_0.56.0         
#>  [9] Biobase_2.46.0              GenomicRanges_1.38.0       
#> [11] GenomeInfoDb_1.22.0         IRanges_2.20.1             
#> [13] S4Vectors_0.24.0            BiocGenerics_0.32.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.5.1            edgeR_3.28.0             BiocSingular_1.2.0      
#>  [4] viridisLite_0.3.0        splines_3.6.1            DelayedMatrixStats_1.8.0
#>  [7] sp_1.4-2                 GenomeInfoDbData_1.2.2   vipor_0.4.5             
#> [10] yaml_2.2.1               pillar_1.4.4             backports_1.1.7         
#> [13] lattice_0.20-41          glue_1.4.1               limma_3.42.2            
#> [16] digest_0.6.25            RColorBrewer_1.1-2       XVector_0.26.0          
#> [19] checkmate_2.0.0          colorspace_1.4-1         cowplot_1.0.0           
#> [22] htmltools_0.4.0          Matrix_1.2-18            pkgconfig_2.0.3         
#> [25] zlibbioc_1.32.0          purrr_0.3.4              scales_1.1.1            
#> [28] tibble_3.0.1             generics_0.0.2           farver_2.0.3            
#> [31] ellipsis_0.3.1           withr_2.2.0              survival_3.1-12         
#> [34] magrittr_1.5             crayon_1.3.4             evaluate_0.14           
#> [37] MASS_7.3-51.6            beeswarm_0.2.3           tools_3.6.1             
#> [40] fitdistrplus_1.1-1       lifecycle_0.2.0          stringr_1.4.0           
#> [43] munsell_0.5.0            locfit_1.5-9.4           irlba_2.3.3             
#> [46] akima_0.6-2.1            compiler_3.6.1           rsvd_1.0.3              
#> [49] rlang_0.4.6              grid_3.6.1               RCurl_1.98-1.2          
#> [52] BiocNeighbors_1.4.1      bitops_1.0-6             labeling_0.3            
#> [55] rmarkdown_2.2            gtable_0.3.0             R6_2.4.1                
#> [58] gridExtra_2.3            knitr_1.28               dplyr_1.0.0             
#> [61] stringi_1.4.6            ggbeeswarm_0.6.0         Rcpp_1.0.4              
#> [64] vctrs_0.3.1              tidyselect_1.1.0         xfun_0.14

最后編輯于：2020.06.15 18:18:28

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市瘟裸，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌话告，老刑警劉巖，帶你破解...
沈念sama閱讀 216,372評(píng)論 6贊 498
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件沙郭，死亡現(xiàn)場(chǎng)離奇詭異裳朋，居然都是意外死亡病线，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,368評(píng)論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)送挑，“玉大人，你說(shuō)我怎么就攤上這事纺裁∷九欤” “怎么了？”我有些...
開(kāi)封第一講書人閱讀 162,415評(píng)論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵谚殊，是天一觀的道長(zhǎng)蛤铜。經(jīng)常有香客問(wèn)我，道長(zhǎng)剿干，這世上最難降的妖魔是什么虐先？我笑而不...
開(kāi)封第一講書人閱讀 58,157評(píng)論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任蛹批，我火速辦了婚禮，結(jié)果婚禮上腐芍，老公的妹妹穿的比我還像新娘猪勇。我一直安慰自己，他們只是感情好泣刹，可當(dāng)我...
茶點(diǎn)故事閱讀 67,171評(píng)論 6贊 388
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布椅您。她就那樣靜靜地躺著，像睡著了一般雪隧。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上藕畔，一...
開(kāi)封第一講書人閱讀 51,125評(píng)論 1贊 297
城市分裂傳說(shuō)
那天庄拇，我揣著相機(jī)與錄音丛忆，去河邊找鬼。笑死熄诡，一個(gè)胖子當(dāng)著我的面吹牛凰浮，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播菜拓，決...
沈念sama閱讀 40,028評(píng)論 3贊 417
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼笛厦，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了贱鄙？” 一聲冷哼從身側(cè)響起姨谷，我...
開(kāi)封第一講書人閱讀 38,887評(píng)論 0贊 274
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤梦湘，失蹤者是張志新（化名）和其女友劉穎，沒(méi)想到半個(gè)月后哼拔，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體瓣颅，經(jīng)...
沈念sama閱讀 45,310評(píng)論 1贊 310
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡弄捕，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,533評(píng)論 2贊 332
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了穿铆。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片斋荞。...
茶點(diǎn)故事閱讀 39,690評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡平酿，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出筑辨，到底是詐尸還是另有隱情幸逆，我是刑警寧澤，帶...
沈念sama閱讀 35,411評(píng)論 5贊 343
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布楚昭，位于F島的核電站拍顷，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏尿贫。R本人自食惡果不足惜爱沟，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,004評(píng)論 3贊 325
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一呼伸、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧搂根，春花似錦铃辖、人聲如沸。這莊子的主人今日做“春日...
開(kāi)封第一講書人閱讀 31,659評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)芒帕。三九已至丰介，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間带膀，已是汗流浹背橙垢。一陣腳步聲響...
開(kāi)封第一講書人閱讀 32,812評(píng)論 1贊 268
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工钢悲，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人莺琳。一個(gè)月前我還...
沈念sama閱讀 47,693評(píng)論 2贊 368
代替公主和親
正文我出身青樓惭等，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國(guó)和親琳要。傳聞我的和親對(duì)象是個(gè)殘疾皇子秤茅，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,577評(píng)論 2贊 353

2020-06-15 `Splatter`是一個(gè)用于簡(jiǎn)單模擬scRNA-seq數(shù)據(jù)的R包

splatter

clp

15 June, 2020

1. 安裝R包

2. quickstart

3. The Splat simulation

4. The SplatParams object

4.1 Getting and setting

5. Estimating parameters

6. Simulating counts

6.1 Simulating groups

6.2 Simulating paths

6.3 Batch effects

6.4 Convenience functions

7. Other simulations

8. Other expression values

9. Comparing simulations and real data

9.1 Comparing differences

9.2 Making panels

10. Citing Splatter

Session information

4. The `SplatParams` object