2020-06-15 `Splatter`是一個(gè)用于簡(jiǎn)單模擬scRNA-seq數(shù)據(jù)的R包

splatter

clp

15 June, 2020

Splatter是一個(gè)用于簡(jiǎn)單模擬單細(xì)胞RNA測(cè)序數(shù)據(jù)的R包锉矢,本文概述并介紹Splatter的功能。

1. 安裝R包

2種安裝方法铜靶,1個(gè)是從Bioconductor安裝宦芦,一個(gè)是從GitHub安裝。

#  if (!requireNamespace("BiocManager", quietly=TRUE))
#      install.packages("BiocManager")
#  BiocManager::install("splatter")

#  BiocManager::install("Oshlack/splatter", dependencies = TRUE,
#                       build_vignettes = TRUE)

假設(shè)您已經(jīng)有了一個(gè)與您想要模擬的計(jì)數(shù)數(shù)據(jù)矩陣類似的矩陣呐籽,使用Splatter創(chuàng)建模擬數(shù)據(jù)集有兩個(gè)簡(jiǎn)單的步驟盏浇。以下是使用scater包生成的模擬數(shù)據(jù)集的示例:

2. quickstart

# Load package
library(splatter)
#> Warning: package 'splatter' was built under R version 3.6.2

# Create mock data
library(scater)
#> Warning: package 'ggplot2' was built under R version 3.6.2
set.seed(1)
sce <- mockSCE()

# Estimate parameters from mock data
params <- splatEstimate(sce)

# Simulate data using estimated parameters
sim <- splatSimulate(params)

這些步驟將在以下各節(jié)中詳細(xì)說(shuō)明变丧,但簡(jiǎn)要地說(shuō),第一步采用數(shù)據(jù)集并從中估計(jì)模擬參數(shù)绢掰,第二步采用這些參數(shù)并模擬新的數(shù)據(jù)集痒蓬。

3. The Splat simulation

在了解如何估計(jì)參數(shù)之前,讓我們先看看Splatter是如何模擬數(shù)據(jù)的滴劲,以及這些參數(shù)是什么攻晒。我們使用術(shù)語(yǔ)Splat來(lái)指代Splatter自己的模擬,并將其與軟件包本身區(qū)分開(kāi)來(lái)班挖。Splat模型的核心是伽馬-泊松分布鲁捏,用于通過(guò)細(xì)胞計(jì)數(shù)矩陣生成基因。從gamma distribution模擬每個(gè)基因的平均表達(dá)水平萧芙,并且在從Poisson distribution模擬計(jì)數(shù)之前给梅,使用生物變異系數(shù)來(lái)強(qiáng)制均值-方差趨勢(shì)假丧。Splat還允許您模擬表達(dá)異常基因(具有伽馬分布之外的平均表達(dá)的基因)和丟棄(dropout)(基于平均表達(dá)式隨機(jī)剔除計(jì)數(shù))动羽。每個(gè)單細(xì)胞都有一個(gè)預(yù)期的庫(kù)大小(從對(duì)數(shù)正態(tài)分布模擬)包帚,這使得它更容易與給定的數(shù)據(jù)集匹配。

Splat還可以模擬不同類型細(xì)胞之間的差異表達(dá)或不同細(xì)胞類型之間的分化路徑运吓,其中表達(dá)以連續(xù)的方式變化渴邦。這些內(nèi)容將在simulating counts一節(jié)中進(jìn)一步介紹。

4. The SplatParams object

Splat模擬的所有參數(shù)都存儲(chǔ)在SplatParams對(duì)象中拘哨。讓我們創(chuàng)建一個(gè)新的谋梭,看看它是什么樣子。

params <- newSplatParams()
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (Genes)  (Cells)   [Seed] 
#>   10000      100   712777 
#> 
#> 28 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale] 
#>             1            100            0.1            0.1 
#> 
#> Mean: 
#>  (Rate)  (Shape) 
#>     0.3      0.6 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [Probability]    [Down Prob]     [Location]        [Scale] 
#>           0.1            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

除了告訴我們有什么類型的對(duì)象(“A Params object of class SplatParams”)并向我們顯示參數(shù)的值之外倦青,此輸出還為我們提供了一些額外的信息瓮床。我們可以看到,哪些參數(shù)可以通過(guò)splatEstimate函數(shù)估計(jì)(those in parentheses)姨夹,哪些參數(shù)不能估計(jì)(those in brackets)纤垂,哪些參數(shù)的默認(rèn)值已經(jīng)改變(those in ALL CAPS)矾策。有關(guān)Splat模擬參數(shù)的更多詳細(xì)信息磷账,請(qǐng)參閱Splat parameters vignette

4.1 Getting and setting

getParam(params, "nGenes")
#> [1] 10000

或者贾虽,可以使用setParam函數(shù)為參數(shù)賦予新值:

params <- setParam(params, "nGenes", 5000)
getParam(params, "nGenes")
#> [1] 5000

如果要提取多個(gè)參數(shù)(列表)或設(shè)置多個(gè)參數(shù)逃糟,可以使用getParamssetParams函數(shù):

# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
#> $nGenes
#> [1] 8000
#> 
#> $mean.rate
#> [1] 0.5
#> 
#> $mean.shape
#> [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (GENES)  (Cells)   [Seed] 
#>    8000      100   712777 
#> 
#> 28 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale] 
#>             1            100            0.1            0.1 
#> 
#> Mean: 
#>  (RATE)  (SHAPE) 
#>     0.5      0.5 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [PROBABILITY]    [Down Prob]     [Location]        [Scale] 
#>           0.2            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

帶有已更改的參數(shù)現(xiàn)在顯示在所有大寫字母(ALL CAPS)中,以指示它們已從默認(rèn)值更改蓬豁。 我們也可以在調(diào)用newSplatParams時(shí)直接設(shè)置參數(shù):

params <- newSplatParams(lib.loc = 12, lib.scale = 0.6)
getParams(params, c("lib.loc", "lib.scale"))
#> $lib.loc
#> [1] 12
#> 
#> $lib.scale
#> [1] 0.6

5. Estimating parameters

Splat允許您使用splatEstimate函數(shù)從包含計(jì)數(shù)的數(shù)據(jù)集中估計(jì)它的許多參數(shù)绰咽。

# Get the mock counts matrix
counts <- counts(sce)

# Check that counts is an integer matrix
class(counts)
#> [1] "matrix"
typeof(counts)
#> [1] "double"

# Check the dimensions, each row is a gene, each column is a cell
dim(counts)
#> [1] 2000  200

# Show the first few entries
counts[1:5, 1:5]
#>           Cell_001 Cell_002 Cell_003 Cell_004 Cell_005
#> Gene_0001        0        5        7      276       50
#> Gene_0002       12        0        0        0        0
#> Gene_0003       97      292       58       64      541
#> Gene_0004        0        0        0      170       19
#> Gene_0005      105      123      174      565     1061

params <- splatEstimate(counts)

在這里,我們從計(jì)數(shù)矩陣估計(jì)參數(shù)地粪,但是splatEstimate也可以接受SingleCellExperiment對(duì)象取募。估算過(guò)程包括以下步驟:

  • 1.通過(guò)將伽馬分布擬合到平均表達(dá)式水平來(lái)估計(jì)均值參數(shù)。
  • 2.通過(guò)對(duì)庫(kù)大小擬合對(duì)數(shù)正態(tài)分布來(lái)估計(jì)庫(kù)大小參數(shù)蟆技。
  • 3.通過(guò)確定離群點(diǎn)的數(shù)量并將對(duì)數(shù)正態(tài)分布擬合到它們與中位數(shù)的差值來(lái)估計(jì)表達(dá)矩陣中離群點(diǎn)參數(shù)玩敏。
  • 4.使用edgeR包中的estimateDisp函數(shù)估計(jì)BCV參數(shù)。
  • 5.通過(guò)檢查是否存在dropout并將logistic函數(shù)擬合到均值表達(dá)式與零點(diǎn)比例之間的關(guān)系來(lái)估計(jì)丟失參數(shù)质礼。

有關(guān)評(píng)估程序的更多詳細(xì)信息旺聚,請(qǐng)參閱?splatEstimate

6. Simulating counts

一旦我們有了一組滿意的參數(shù)眶蕉,我們就可以使用splatSimulate來(lái)模擬計(jì)數(shù)砰粹。如果我們想要對(duì)參數(shù)進(jìn)行小幅調(diào)整,我們可以將它們作為附加參數(shù)提供造挽,或者碱璃,如果我們不提供任何參數(shù)弄痹,則將使用默認(rèn)值:

sim <- splatSimulate(params, nGenes = 1000)
sim
#> class: SingleCellExperiment 
#> dim: 1000 200 
#> metadata(1): Params
#> assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
#> rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
#> rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
#> colnames(200): Cell1 Cell2 ... Cell199 Cell200
#> colData names(3): Cell Batch ExpLibSize
#> reducedDimNames(0):
#> spikeNames(0):
#> altExpNames(0):

查看splatSimulate的輸出可以看出,sim是具有1000個(gè)特征(基因)和200個(gè)樣本(細(xì)胞)的SingleCellExperient對(duì)象嵌器。此對(duì)象的主要部分是包含模擬計(jì)數(shù)(使用counts訪問(wèn))的按樣本特征矩陣界酒,盡管它也可以包含其他表達(dá)式(expression measures)度量,如FPKM或TPM嘴秸。此外毁欣,SingleCellExperient包含每個(gè)細(xì)胞的表型信息(使用colData訪問(wèn))和每個(gè)基因的特征信息(使用rowData訪問(wèn))。Splatter使用這些slots以及assays來(lái)存儲(chǔ)有關(guān)模擬中間值的信息岳掐。

# Access the counts
counts(sim)[1:5, 1:5]
#>       Cell1 Cell2 Cell3 Cell4 Cell5
#> Gene1   142   390   110   188   521
#> Gene2  1473   431  1214   881   401
#> Gene3   753   667   940   379   469
#> Gene4  2136  2003  1557  1223   700
#> Gene5   887   529   139   213  1027
# Information about genes
head(rowData(sim))
#> DataFrame with 6 rows and 4 columns
#>           Gene     BaseGeneMean OutlierFactor         GeneMean
#>       <factor>        <numeric>     <numeric>        <numeric>
#> Gene1    Gene1 149.828982351488             1 149.828982351488
#> Gene2    Gene2 316.265527453104             1 316.265527453104
#> Gene3    Gene3 208.039144740334             1 208.039144740334
#> Gene4    Gene4 713.244884835822             1 713.244884835822
#> Gene5    Gene5 171.193128303906             1 171.193128303906
#> Gene6    Gene6 147.896699333501             1 147.896699333501
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 3 columns
#>           Cell       Batch       ExpLibSize
#>       <factor> <character>        <numeric>
#> Cell1    Cell1      Batch1 382540.907555357
#> Cell2    Cell2      Batch1  359403.82489391
#> Cell3    Cell3      Batch1 361406.811639421
#> Cell4    Cell4      Batch1 351539.053237628
#> Cell5    Cell5      Batch1 368536.896292943
#> Cell6    Cell6      Batch1 360324.111883911
# Gene by cell matrices
names(assays(sim))
#> [1] "BatchCellMeans" "BaseCellMeans"  "BCV"            "CellMeans"     
#> [5] "TrueCounts"     "counts"
# Example of cell means matrix
assays(sim)$CellMeans[1:5, 1:5]
#>           Cell1     Cell2     Cell3     Cell4    Cell5
#> Gene1  141.2097  375.1557  102.9034  195.2901 510.1044
#> Gene2 1494.8964  421.8598 1181.7091  844.7958 420.4107
#> Gene3  739.8169  693.3015  986.7745  366.4702 467.8734
#> Gene4 2192.9068 2015.6644 1453.3987 1231.2689 735.8236
#> Gene5  900.2559  573.5933  146.7801  194.2355 983.0344

輸出SingleCellExperient的另一個(gè)(很大)好處是我們可以立即訪問(wèn)其他分析包凭疮,比如scater中的繪圖函數(shù)。例如串述,我們可以制作PCA圖:

# Use scater to calculate logcounts
sim <- logNormCounts(sim)
# Plot PCA
sim <- runPCA(sim)
plotPCA(sim)

注意:您的值和繪圖可能看起來(lái)不同执解,因?yàn)槟M是隨機(jī)的,每次運(yùn)行時(shí)都會(huì)產(chǎn)生不同的結(jié)果纲酗。

有關(guān)SingleCellExperient對(duì)象的更多詳細(xì)信息衰腌,請(qǐng)參閱[vignette] SCE-vignette。有關(guān)您可以使用scater執(zhí)行哪些操作的信息觅赊,請(qǐng)參閱相關(guān)的scater文檔和網(wǎng)站vignette右蕊。
splatSimulate函數(shù)輸出以下有關(guān)模擬的附加信息:

  • Cell information (colData)

    • Cell - Unique cell identifier.
    • Group - The group or path the cell belongs to.
    • ExpLibSize - The expected library size for that cell.
    • Step (paths only) - How far along the path each cell is.
  • Gene information (rowData)

    • Gene - Unique gene identifier.
    • BaseGeneMean - The base expression level for that gene.
    • OutlierFactor - Expression outlier factor for that gene (1 is not an outlier).
    • GeneMean - Expression level after applying outlier factors.
    • DEFac[Group] - The differential expression factor for each gene in a particular group (1 is not differentially expressed).
    • GeneMean[Group] - Expression level of a gene in a particular group after applying differential expression factors.
  • Gene by cell information (assays)

    • BaseCellMeans - The expression of genes in each cell adjusted for expected library size.
    • BCV - The Biological Coefficient of Variation for each gene in each cell.
    • CellMeans - The expression level of genes in each cell adjusted for BCV.
    • TrueCounts - The simulated counts before dropout.
    • Dropout - Logical matrix showing which counts have been dropped in which cells.

Splatter添加的值使用UpperCamelCase命名,將其與scater等包使用的underscore_naming分開(kāi)吮螺。有關(guān)模擬的更多信息饶囚,請(qǐng)參見(jiàn)?splatSimulate

6.1 Simulating groups

到目前為止鸠补,我們只模擬了單個(gè)細(xì)胞群體萝风,但我們通常感興趣的是研究混合細(xì)胞群體,看看存在什么類型的細(xì)胞紫岩,或者它們之間有什么不同规惰。Splatter可以通過(guò)更改method參數(shù)來(lái)模擬這些情況,這里我們將通過(guò)指定group.pro參數(shù)并將method參數(shù)設(shè)置為"groups"來(lái)模擬兩個(gè)組:

注:我們還將Verbose參數(shù)設(shè)置為FALSE泉蝌,以停止Splatter打印進(jìn)度消息歇万。

sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
                            verbose = FALSE)
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.groups, colour_by = "Group")
image.png

因?yàn)槲覀円呀?jīng)將兩個(gè)組的概率都設(shè)置為0.5,所以我們應(yīng)該在每個(gè)組中獲得大致相等的細(xì)胞數(shù)量(在本例中大約是50個(gè))梨与。如果我們想要不均勻的組堕花,我們可以將group.prob設(shè)置為和為1的任何概率集。

6.2 Simulating paths

另一種通常令人感興趣的情況是分化過(guò)程粥鞋,其中一種細(xì)胞類型正在轉(zhuǎn)變?yōu)榱硪环N細(xì)胞類型缘挽。Splatter通過(guò)模擬兩組之間的一系列步驟并隨機(jī)將每個(gè)細(xì)胞分配到一個(gè)步驟來(lái)近似這一過(guò)程。我們可以使用"paths"方法創(chuàng)建這種模擬。

sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
                           verbose = FALSE)
sim.paths <- logNormCounts(sim.paths)
sim.paths <- runPCA(sim.paths)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.paths, colour_by = "Step")
image.png

這里的顏色代表了每個(gè)細(xì)胞的“step”壕曼,或者它沿著分化路徑走了多遠(yuǎn)苏研。我們可以看到,深色的細(xì)胞與原始細(xì)胞類型更相似腮郊,淺色的細(xì)胞更接近最終的分化細(xì)胞類型摹蘑。通過(guò)設(shè)置額外的參數(shù),可以模擬更復(fù)雜的過(guò)程(例如轧飞,來(lái)自單個(gè)祖細(xì)胞的多種成熟細(xì)胞類型)衅鹿。

6.3 Batch effects

在任何測(cè)序?qū)嶒?yàn)的分析中,另一個(gè)重要的因素是批次效應(yīng)过咬,這是一組同時(shí)處理的樣本共同存在的技術(shù)差異大渤。我們通過(guò)告訴Splatter每批中有多少個(gè)細(xì)胞來(lái)應(yīng)用批處理效果:

sim.batches <- splatSimulate(batchCells = c(50, 50), verbose = FALSE)
sim.batches <- logNormCounts(sim.batches)
sim.batches <- runPCA(sim.batches)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.batches, colour_by = "Batch")
image.png

這看起來(lái)很像我們模擬groups的時(shí)候,那是因?yàn)檫^(guò)程非常相似掸绞。不同之處在于泵三,批量效應(yīng)適用于所有基因,而不僅僅是那些差異表達(dá)的基因衔掸,而且影響通常較小烫幕。通過(guò)組合分組和批次,我們既可以模擬我們不感興趣的不需要的變化(batch)敞映,也可以模擬我們正在尋找的需要的變化(group):

sim.groups <- splatSimulate(batchCells = c(50, 50), group.prob = c(0.5, 0.5),
                            method = "groups", verbose = FALSE)
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
plotPCA(sim.groups, shape_by = "Batch", colour_by = "Group")
image.png

在這里较曼,我們看到組(first component)的影響比批處理效果(second component)強(qiáng),但是通過(guò)調(diào)整參數(shù)驱显,我們可以使批處理效果占主導(dǎo)地位诗芜。

6.4 Convenience functions

每種Splatter模擬方法都有各自的便捷功能。模擬單個(gè)種群使用splatSimulateSingle()(相當(dāng)于splatSimulate(method = "single"))埃疫,使用splatSimulateGroups()(相當(dāng)于splatSimulate(method = "groups"))或模擬路徑使用splatSimulatePaths()(相當(dāng)于splatSimulate(method = "paths"))。

7. Other simulations

除了它自己的Splat模擬方法孩哑,Splatter包還包含其他已發(fā)表的單細(xì)胞RNA-seq模擬的實(shí)現(xiàn)栓霜,或者包裝其他包中包含的模擬。要查看所有可用的模擬横蜒,請(qǐng)運(yùn)行listSims()函數(shù):

listSims()
#> Splatter currently contains 14 simulations 
#> 
#> Splat (splat) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation generates means from a gamma distribution, adjusts them for BCV and generates counts from a gamma-poisson. Dropout and batch effects can be optionally added. 
#> 
#> Splat Single (splatSingle) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation with a single population. 
#> 
#> Splat Groups (splatGroups) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation with multiple groups. Each group can have it's own differential expression probability and fold change distribution. 
#> 
#> Splat Paths (splatPaths) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> The Splat simulation with differentiation paths. Each path can have it's own length, skew and probability. Genes can change in non-linear ways. 
#> 
#> Kersplat (kersplat) 
#> DOI:      GitHub: Oshlack/splatter 
#> The Kersplat simulation extends the Splat model by adding a gene network, more complex cell structure, doublets and empty cells (Experimental). 
#> 
#> Simple (simple) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter 
#> A simple simulation with gamma means and negative binomial counts. 
#> 
#> Lun (lun) 
#> DOI: 10.1186/s13059-016-0947-7    GitHub: MarioniLab/Deconvolution2016 
#> Gamma distributed means and negative binomial counts. Cells are given a size factor and differential expression can be simulated with fixed fold changes. 
#> 
#> Lun 2 (lun2) 
#> DOI: 10.1093/biostatistics/kxw055     GitHub: MarioniLab/PlateEffects2016 
#> Negative binomial counts where the means and dispersions have been sampled from a real dataset. The core feature of the Lun 2 simulation is the addition of plate effects. Differential expression can be added between two groups of plates and optionally a zero-inflated negative-binomial can be used. 
#> 
#> scDD (scDD) 
#> DOI: 10.1186/s13059-016-1077-y    GitHub: kdkorthauer/scDD 
#> The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions. 
#> 
#> BASiCS (BASiCS) 
#> DOI: 10.1371/journal.pcbi.1004333     GitHub: catavallejos/BASiCS 
#> The BASiCS simulation is based on a bayesian model used to deconvolve biological and technical variation and includes spike-ins and batch effects. 
#> 
#> mfa (mfa) 
#> DOI: 10.12688/wellcomeopenres.11087.1     GitHub: kieranrcampbell/mfa 
#> The mfa simulation produces a bifurcating pseudotime trajectory. This can optionally include genes with transient changes in expression and added dropout. 
#> 
#> PhenoPath (pheno) 
#> DOI: 10.1101/159913   GitHub: kieranrcampbell/phenopath 
#> The PhenoPath simulation produces a pseudotime trajectory with different types of genes. 
#> 
#> ZINB-WaVE (zinb) 
#> DOI: 10.1101/125112   GitHub: drisso/zinbwave 
#> The ZINB-WaVE simulation simulates counts from a sophisticated zero-inflated negative-binomial distribution including cell and gene-level covariates. 
#> 
#> SparseDC (sparseDC) 
#> DOI: 10.1093/nar/gkx1113      GitHub: cran/SparseDC 
#> The SparseDC simulation simulates a set of clusters across two conditions, where some clusters may be present in only one condition.

每模種擬都有其自己的前綴胳蛮,該前綴提供了與該模擬相關(guān)聯(lián)的函數(shù)的名稱。例如丛晌,簡(jiǎn)單模擬的前綴是simple仅炊,因此它會(huì)將其參數(shù)存儲(chǔ)在一個(gè)SimpleParams對(duì)象中,該對(duì)象可以使用newSimpleParams()創(chuàng)建澎蛛,也可以使用simpleEstimate()從實(shí)際數(shù)據(jù)中估計(jì)抚垄。要使用該模擬來(lái)模擬數(shù)據(jù),您可以使用simpleSimulate()。每次模擬返回一個(gè)SingleCellExperient對(duì)象呆馁,其中間值與splatSimulate()返回的值類似桐经。有關(guān)每個(gè)模擬的更多詳細(xì)信息,請(qǐng)參閱相應(yīng)的幫助頁(yè)面(例如浙滤,?simpleSimulate了解簡(jiǎn)單模擬的工作原理阴挣,或?lun2Estimate了解Lun 2模擬如何估計(jì)參數(shù)的詳細(xì)信息)或參考相應(yīng)的論文或包。

8. Other expression values

Splatter旨在模擬計(jì)數(shù)數(shù)據(jù)纺腊,但一些分析方法需要其他表達(dá)式值畔咧,特別是長(zhǎng)度歸一化的值,如TPM或FPKM揖膜。scater包具有將這些值添加到一個(gè)SingleCellExperiment對(duì)象中的函數(shù)盒卸,但是它們需要每個(gè)基因的長(zhǎng)度。addGeneLengths函數(shù)可以用來(lái)模擬這些長(zhǎng)度:

sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(rowData(sim))
#> DataFrame with 6 rows and 3 columns
#>           Gene          GeneMean    Length
#>       <factor>         <numeric> <numeric>
#> Gene1    Gene1  10.5022855005221      2520
#> Gene2    Gene2  1.15678349696842       729
#> Gene3    Gene3  1.52303353341067      6009
#> Gene4    Gene4 0.249127173451236     14184
#> Gene5    Gene5   1.3196168428461      3609
#> Gene6    Gene6  1.11978689263638      2468

然后次氨,我們可以使用scater計(jì)算TPM

tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
tpm(sim)[1:5, 1:5]
#>          Cell1     Cell2     Cell3    Cell4     Cell5
#> Gene1 556.8385 954.67814 251.01798 313.8751 771.18856
#> Gene2 213.8749 220.00813 216.92912   0.0000 222.15308
#> Gene3   0.0000  53.38190  26.31741   0.0000   0.00000
#> Gene4   0.0000   0.00000   0.00000   0.0000   0.00000
#> Gene5 172.8066  44.44055 131.45580   0.0000  44.87381

addGeneLengths模擬長(zhǎng)度的默認(rèn)方法是從對(duì)數(shù)正態(tài)分布生成值蔽介,然后四舍五入為整數(shù)長(zhǎng)度。這種分布的參數(shù)是基于人類蛋白質(zhì)編碼基因的煮寡,但如果需要可以調(diào)整(例如虹蓄,對(duì)于其他物種)⌒宜海或者薇组,可以從提供的向量中采樣長(zhǎng)度(有關(guān)詳細(xì)信息和示例,請(qǐng)參閱?addGeneLengths)坐儿。

9. Comparing simulations and real data

在模擬數(shù)據(jù)之后律胀,您可能想要做的一件事是將其與真實(shí)數(shù)據(jù)集進(jìn)行比較,或者比較具有不同參數(shù)或模型的模擬貌矿。Splatter提供了一個(gè)函數(shù)compareSCEs炭菌,旨在簡(jiǎn)化這些比較。顧名思義逛漫,這個(gè)函數(shù)獲取一個(gè)SingleCellExperient對(duì)象列表黑低,組合數(shù)據(jù)集,并生成一些比較它們的曲線圖酌毡。讓我們做兩個(gè)小的模擬克握,看看它們?nèi)绾伪容^。

sim1 <- splatSimulate(nGenes = 1000, batchCells = 20, verbose = FALSE)
sim2 <- simpleSimulate(nGenes = 1000, nCells = 20, verbose = FALSE)
comparison <- compareSCEs(list(Splat = sim1, Simple = sim2))

names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means"        "Variances"    "MeanVar"      "LibrarySizes" "ZerosGene"   
#> [6] "ZerosCell"    "MeanZeros"    "VarGeneCor"

返回的列表有三個(gè)項(xiàng)目枷踏。前兩個(gè)是按基因(RowData)和按細(xì)胞(ColData)組合的數(shù)據(jù)集菩暗,第三個(gè)包含一些比較圖(使用ggplot2生成),例如均值分布圖:

comparison$Plots$Means
image.png

這些只是您可能想要考慮的幾個(gè)曲線圖旭蠕,但是使用返回的數(shù)據(jù)應(yīng)該很容易做出更多的曲線圖停团。例如旷坦,我們可以繪制表達(dá)基因的數(shù)量與文庫(kù)大小的關(guān)系圖:

library("ggplot2")
ggplot(comparison$ColData, aes(x = sum, y = detected, colour = Dataset)) +
    geom_point()
image.png

9.1 Comparing differences

有時(shí),查看它們之間的差異可能更有趣客蹋,而不是直觀地比較數(shù)據(jù)集塞蹭。我們可以使用diffSCEs函數(shù)來(lái)實(shí)現(xiàn)這一點(diǎn)。與compareSCEs類似讶坯,該函數(shù)接受SingleCellExperient對(duì)象的列表番电,但是現(xiàn)在我們還指定了一個(gè)作為引用。返回了一系列類似的曲線圖辆琅,但它們并沒(méi)有顯示總體分布漱办,而是顯示了與參考數(shù)據(jù)的不同之處。

difference <- diffSCEs(list(Splat = sim1, Simple = sim2), ref = "Simple")
difference$Plots$Means
image.png

我們還得到了一系列可以用來(lái)比較分布的分位數(shù)-分位數(shù)(Q-Q)曲線圖婉烟。

difference$QQPlots$Means
image.png

9.2 Making panels

這些比較中的每一個(gè)都有幾個(gè)圖娩井,可能會(huì)有很多值得關(guān)注的地方。為了簡(jiǎn)化操作似袁,或者為出版物制作圖形嗅辣,可以使用函數(shù)makeCompPanel躏仇、makeDiffPanelmakeOverallPanel伪阶。 這些函數(shù)使用cowplot包將繪圖合并到單個(gè)panel中歼捏。panel可能非常大,很難查看(例如而涉,在RStudio的繪圖查看器中)著瓶,因此最好輸出panel并分別查看它們。幸運(yùn)的是啼县,cowplot提供了一個(gè)方便的保存圖像的功能材原。以下是輸出每個(gè)panel的一些建議參數(shù):

#  # This code is just an example and is not run
#  panel <- makeCompPanel(comparison)
#  cowplot::save_plot("comp_panel.png", panel, nrow = 4, ncol = 3)
#  
#  panel <- makeDiffPanel(difference)
#  cowplot::save_plot("diff_panel.png", panel, nrow = 3, ncol = 5)
#  
#  panel <- makeOverallPanel(comparison, difference)
#  cowplot::save_plot("overall_panel.png", panel, ncol = 4, nrow = 7)

10. Citing Splatter

如果您在工作中使用 Splatter,請(qǐng)引用這篇的論文:

citation("splatter")
#> 
#>   Zappia L, Phipson B, Oshlack A. Splatter: Simulation of single-cell
#>   RNA sequencing data. Genome Biology. 2017;
#>   doi:10.1186/s13059-017-1305-0
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     author = {Luke Zappia and Belinda Phipson and Alicia Oshlack},
#>     title = {Splatter: simulation of single-cell RNA sequencing data},
#>     journal = {Genome Biology},
#>     year = {2017},
#>     url = {http://dx.doi.org/10.1186/s13059-017-1305-0},
#>     doi = {10.1186/s13059-017-1305-0},
#>   }

Session information

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] scater_1.14.1               ggplot2_3.3.1              
#>  [3] splatter_1.10.1             SingleCellExperiment_1.8.0 
#>  [5] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
#>  [7] BiocParallel_1.20.0         matrixStats_0.56.0         
#>  [9] Biobase_2.46.0              GenomicRanges_1.38.0       
#> [11] GenomeInfoDb_1.22.0         IRanges_2.20.1             
#> [13] S4Vectors_0.24.0            BiocGenerics_0.32.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.5.1            edgeR_3.28.0             BiocSingular_1.2.0      
#>  [4] viridisLite_0.3.0        splines_3.6.1            DelayedMatrixStats_1.8.0
#>  [7] sp_1.4-2                 GenomeInfoDbData_1.2.2   vipor_0.4.5             
#> [10] yaml_2.2.1               pillar_1.4.4             backports_1.1.7         
#> [13] lattice_0.20-41          glue_1.4.1               limma_3.42.2            
#> [16] digest_0.6.25            RColorBrewer_1.1-2       XVector_0.26.0          
#> [19] checkmate_2.0.0          colorspace_1.4-1         cowplot_1.0.0           
#> [22] htmltools_0.4.0          Matrix_1.2-18            pkgconfig_2.0.3         
#> [25] zlibbioc_1.32.0          purrr_0.3.4              scales_1.1.1            
#> [28] tibble_3.0.1             generics_0.0.2           farver_2.0.3            
#> [31] ellipsis_0.3.1           withr_2.2.0              survival_3.1-12         
#> [34] magrittr_1.5             crayon_1.3.4             evaluate_0.14           
#> [37] MASS_7.3-51.6            beeswarm_0.2.3           tools_3.6.1             
#> [40] fitdistrplus_1.1-1       lifecycle_0.2.0          stringr_1.4.0           
#> [43] munsell_0.5.0            locfit_1.5-9.4           irlba_2.3.3             
#> [46] akima_0.6-2.1            compiler_3.6.1           rsvd_1.0.3              
#> [49] rlang_0.4.6              grid_3.6.1               RCurl_1.98-1.2          
#> [52] BiocNeighbors_1.4.1      bitops_1.0-6             labeling_0.3            
#> [55] rmarkdown_2.2            gtable_0.3.0             R6_2.4.1                
#> [58] gridExtra_2.3            knitr_1.28               dplyr_1.0.0             
#> [61] stringi_1.4.6            ggbeeswarm_0.6.0         Rcpp_1.0.4              
#> [64] vctrs_0.3.1              tidyselect_1.1.0         xfun_0.14
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市瘟裸,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌话告,老刑警劉巖,帶你破解...
    沈念sama閱讀 216,372評(píng)論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件沙郭,死亡現(xiàn)場(chǎng)離奇詭異裳朋,居然都是意外死亡病线,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,368評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)送挑,“玉大人,你說(shuō)我怎么就攤上這事纺裁∷九欤” “怎么了?”我有些...
    開(kāi)封第一講書人閱讀 162,415評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵谚殊,是天一觀的道長(zhǎng)蛤铜。 經(jīng)常有香客問(wèn)我,道長(zhǎng)剿干,這世上最難降的妖魔是什么虐先? 我笑而不...
    開(kāi)封第一講書人閱讀 58,157評(píng)論 1 292
  • 正文 為了忘掉前任蛹批,我火速辦了婚禮,結(jié)果婚禮上腐芍,老公的妹妹穿的比我還像新娘猪勇。我一直安慰自己,他們只是感情好泣刹,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,171評(píng)論 6 388
  • 文/花漫 我一把揭開(kāi)白布椅您。 她就那樣靜靜地躺著,像睡著了一般雪隧。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上藕畔,一...
    開(kāi)封第一講書人閱讀 51,125評(píng)論 1 297
  • 那天庄拇,我揣著相機(jī)與錄音丛忆,去河邊找鬼。 笑死熄诡,一個(gè)胖子當(dāng)著我的面吹牛凰浮,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播菜拓,決...
    沈念sama閱讀 40,028評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼笛厦,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了贱鄙?” 一聲冷哼從身側(cè)響起姨谷,我...
    開(kāi)封第一講書人閱讀 38,887評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤梦湘,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后哼拔,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體瓣颅,經(jīng)...
    沈念sama閱讀 45,310評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡弄捕,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,533評(píng)論 2 332
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了穿铆。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片斋荞。...
    茶點(diǎn)故事閱讀 39,690評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡平酿,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出筑辨,到底是詐尸還是另有隱情幸逆,我是刑警寧澤,帶...
    沈念sama閱讀 35,411評(píng)論 5 343
  • 正文 年R本政府宣布楚昭,位于F島的核電站拍顷,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏尿贫。R本人自食惡果不足惜爱沟,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,004評(píng)論 3 325
  • 文/蒙蒙 一呼伸、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧搂根,春花似錦铃辖、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 31,659評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)芒帕。三九已至丰介,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間带膀,已是汗流浹背橙垢。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 32,812評(píng)論 1 268
  • 我被黑心中介騙來(lái)泰國(guó)打工钢悲, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人莺琳。 一個(gè)月前我還...
    沈念sama閱讀 47,693評(píng)論 2 368
  • 正文 我出身青樓惭等,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親琳要。 傳聞我的和親對(duì)象是個(gè)殘疾皇子秤茅,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,577評(píng)論 2 353