系統(tǒng)學(xué)習(xí)單細(xì)胞轉(zhuǎn)錄組測序scRNA-Seq(四)

劉小澤寫于19.3.26
陌生的領(lǐng)域就像一張白紙呕臂，怎么畫就看自己了

前言

關(guān)于單細(xì)胞測序破托，我們已經(jīng)知道了它和bulk RNA-seq的區(qū)別，簡單說就是scRNA體現(xiàn)異質(zhì)性（個體）歧蒋，bulk體現(xiàn)平均程度(總體)土砂，因此利用bulk RNA不能區(qū)分一個樣本中的不同細(xì)胞類型

scRNA&bulk RNAseq

做scRNA的最大的好處就是可以"對癥下藥"(當(dāng)然這也是最理想的情況=》什么細(xì)胞配合什么種類或者劑量的藥物)

對癥下藥(圖片來自：https://stm.sciencemag.org/content/9/408/eaan4730.full)

Biomarker：A biological molecule found in blood, other body fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease. A biomarker may be used to see how well the body responds to a treatment for a disease or condition. Also called molecular marker and signature molecule.

一般在下機(jī)后會得到一個表達(dá)矩陣，行為基因谜洽，列為細(xì)胞樣本(根據(jù)測序平臺不同細(xì)胞數(shù)不同萝映，目前商業(yè)化平臺有10X和BD，一般都能達(dá)到2000-8000細(xì)胞量阐虚，可以覆蓋絕大部分組織的single cell群體類型)序臂。
還包括兩個協(xié)變量(covariates：解釋變量，不為實驗者所操縱实束，但仍影響實驗結(jié)果)：gene-level(比如GC含量)和cell-level(比如細(xì)胞測序的批次信息)奥秆。

單細(xì)胞和bulk轉(zhuǎn)錄組表達(dá)矩陣主要有兩點不同：

bulk的表達(dá)矩陣是reads比對到基因的數(shù)目；單細(xì)胞的表達(dá)矩陣是reads比對到某個細(xì)胞的某個基因的數(shù)目(barcode用來區(qū)分細(xì)胞咸灿；UMI用來區(qū)分轉(zhuǎn)錄本/基因)
單細(xì)胞存在許多的0构订，有兩個可能：第一種是真的基因在細(xì)胞中不表達(dá)(參與細(xì)胞分化的基因并不是在每個cell cycle都表達(dá))；第二種是技術(shù)誤差析显，基因?qū)嶋H表達(dá)但測序沒有測到(也就是"dropouts")

從頭開始慢慢理解

這一部分暫時不討論細(xì)節(jié)鲫咽，先看Big Picture

探索表達(dá)矩陣

The matrix of counts is the number of sequenced reads aligned to each gene and each sample

假設(shè)這個矩陣有10個基因(Lamp5, Fam19a1, Cnr1, Rorb, Sparcl1, Crym, Ddah1, Lmo3, Serpine2, Pde1a) 和5個細(xì)胞樣本(SRR2140028, SRR2140022, SRR2140055, SRR2140083, SRR2139991)

表達(dá)矩陣如下：

> counts
         SRR2140028 SRR2140022 SRR2140055 SRR2140083 SRR2139991
Lamp5            10         11          0          0          8
Fam19a1          11          9          0          6          0
Cnr1              0          0          0         12          0
Rorb              0          8          0          0          0
Sparcl1           0          7          0         13          0
Crym              8          9         10         12          5
Ddah1            16          0          0          8          0
Lmo3              0          0          8         13          0
Serpine2          8          0          0          0          0
Pde1a             0         14          8         11          0

# head of count matrix
counts[1:3, 1:3]

# count of specific gene and cell
reads <- counts["Cnr1", "SRR2140055"]

# overall proportion of zero counts 
pZero <- mean(counts==0)

# cell library size
libsize <- colSums(counts)

The cell coverage is the average number of expressed genes that are greater than zero in each cell.

目的：ggplot2畫出cell coverage

假設(shè)目前有一個關(guān)于細(xì)胞樣本的協(xié)變量(cell-level)，內(nèi)容如下：

> cell_info
                names batch patient 
SRR2140028 SRR2140028     1       a      
SRR2140022 SRR2140022     1       a     
SRR2140055 SRR2140055     2       b     
SRR2140083 SRR2140083     2       c     
SRR2139991 SRR2139991     2       c

先計算coverage谷异，然后加在cell_info的最后

# find cell coverage 計算count大于0的平均值
# 先將表達(dá)量為0的值賦值一個NA分尸，然后計算時忽略掉它們
is.na(counts)=counts==0
coverage <- colMeans(counts,na.rm=T)

cell_info$coverage <- coverage
library(ggplot2)
ggplot(cell_info, aes(x = names, y = coverage)) + 
  geom_point() +
  ggtitle('Cell Coverage') + 
  xlab('Cell Name') + 
  ylab('Coverage')

看看大體的流程

測序=》single cell 發(fā)展歷史

圖片來自：Exponential scaling of single-cell RNA-seq in the past decade. (2018 Nature Protocols)

2009-2017不到十年，已報道的細(xì)胞研究數(shù)目就從1個飛躍至一百萬個

scRNA發(fā)展

盡管這些方法各不相同歹嘹，但原理的主要不同之處有兩個：分子定量(Quantification)和捕獲方法(Capture)

RNA分子定量（Quantification）：主要是兩種技術(shù)(full-length和tag-based)
The full-length：基于全長箩绍，試圖獲取每個轉(zhuǎn)錄本覆蓋度一致且獨特的全長片段；
The tag-based：基于標(biāo)簽探針尺上，捕獲RNA的5'或3'端序列
捕獲方法（Capture）：不僅決定了細(xì)胞的篩選方式以及實驗的通量材蛛，同時也間接反映了測序外的其他信息圆到。目前主要有三種捕獲方法：
微孔 microwell
微流 microfluidic
微滴 droplet

詳細(xì)看這里：https://hemberg-lab.github.io/scRNA.seq.course/introduction-to-single-cell-rna-seq.html

目前兩大商業(yè)化單細(xì)胞測序公司：

10X Genomics Chromium
利用微流控、油滴包裹和barcode標(biāo)記實現(xiàn)高通量細(xì)胞捕獲和標(biāo)記卑吭，一次最多捕獲8萬細(xì)胞芽淡。特點就是：細(xì)胞通量高，建庫成本低豆赏，捕獲周期短
主要用于細(xì)胞分型和標(biāo)記因子的鑒定挣菲，從而實現(xiàn)對細(xì)胞群體的劃分與細(xì)胞群體間基因表達(dá)差異的檢測，此外該技術(shù)還可以預(yù)測細(xì)胞分化與研究發(fā)育軌
跡掷邦。

BD Rhapsody
基于微孔平臺白胀，通過剛性磁珠進(jìn)行單細(xì)胞捕獲(捕獲后磁珠可以保存3個月以上)；利用成像系統(tǒng)對單細(xì)胞分離進(jìn)行質(zhì)控抚岗。特點是：細(xì)胞樣本損傷谢蚋堋；質(zhì)控可視化宣蔚，監(jiān)控雙細(xì)胞比例向抢；細(xì)胞捕獲較穩(wěn)定；可以結(jié)合蛋白表達(dá)量分析轉(zhuǎn)錄組胚委；可以設(shè)計panel分析特定基因

質(zhì)控

目的就是減少技術(shù)誤差笋额，去掉有問題的細(xì)胞，一般反映細(xì)胞是否有問題主要看兩個方面：

library size：能比對到每個細(xì)胞的總reads數(shù)(一個ibrary指一個細(xì)胞)
cell coverage：每個細(xì)胞中表達(dá)基因的平均數(shù)

標(biāo)準(zhǔn)Workflow

來自：Bioconductor workflow for single-cell RNA sequencing

可以從下圖看到：

第一步就是基因和細(xì)胞層次的標(biāo)準(zhǔn)化(normalization)：消除與研究無關(guān)的生物和技術(shù)影響篷扩；

第二步是降維兄猩，基因數(shù)量決定維度，需要從J個基因降到K個基因鉴未，實現(xiàn)了兩個事情枢冤，一是維度降低可以更好掌控數(shù)據(jù)，二是保留感興趣的特征同時減少了背景噪音铜秆；

第三步是聚類分群淹真，按照降維后的矩陣（K x N，N是細(xì)胞數(shù)量）连茧，這一步給了每個細(xì)胞一個cluster number核蘸；

第四步是差異分析，找到biomarker啸驯，也就是兩組細(xì)胞的表達(dá)差異基因

workflow

慢慢翻開細(xì)枝末節(jié)

加載客扎、創(chuàng)建、獲取single-cell數(shù)據(jù)集

單細(xì)胞分析利用的幾個包都是高度整合的罚斗，里面需要用到許多S4對象徙鱼，只有理解好這些對象才能看懂分析流程

首先了解一下SingleCellExperiment對象

介紹：

https://www.bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html

SingleCellExperiment(SCE)是一個S4對象，作者是Davide Risso and Aaron Lun 。官網(wǎng)稱它是light-weight container for single-cell genomics data 袱吆，看出它是一個容器厌衙，可以裝單細(xì)胞組學(xué)數(shù)據(jù)。那么就知道和我們?nèi)粘＝佑|的一般表達(dá)矩陣不同绞绒，它肯定還包含一些表型信息(對象Object 簡單理解就是：包含多個數(shù)據(jù)框婶希、矩陣或者列表，用到哪個提取哪個蓬衡。數(shù)據(jù)框取子集我們用$饲趋，對象取子集我們要用@)

之前知道了單細(xì)胞數(shù)據(jù)一般有三個矩陣(表達(dá)矩陣raw counts絮重、協(xié)變量信息gene-level & cell-level metadata)，這個對象就可以將這三個同時存儲在一個容器中芜抒，另外還有一些其他重要信息纠修，比如：spike-in info，dimentionality reduction coordinates官紫，size factors for each cell

下載、加載包

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SingleCellExperiment", version = "3.8")
library(SingleCellExperiment)

創(chuàng)建對象

三個要素：表達(dá)矩陣、行名讨衣、列名

> # 先設(shè)計矩陣(假設(shè)counts值服從泊松分布)
> counts <- matrix(rpois(8,lambda = 10),ncol = 2,nrow = 4)
> rownames(counts) <- c("Lamp5","Fam19a1","Cnr1","Rorb")
> colnames(counts) <- c("SRR2140021","SRR2140023")
> counts
        SRR2140021 SRR2140023
Lamp5            4          6
Fam19a1          9         13
Cnr1            14          5
Rorb            15          7

# 創(chuàng)建對象第一種方法
#第一個參數(shù)assays就是取表達(dá)矩陣的list
> sce <- SingleCellExperiment(assays = list(counts = counts),
+                             rowData = data.frame(gene = rownames(counts)),
+                             colData = data.frame(cell = colnames(counts)))
> sce
class: SingleCellExperiment 
dim: 4 2 
metadata(0):
assays(1): counts
rownames(4): Lamp5 Fam19a1 Cnr1 Rorb
rowData names(1): gene
colnames(2): SRR2140021 SRR2140023
colData names(1): cell
reducedDimNames(0):
spikeNames(0):

# 創(chuàng)建對象第二種方法(先構(gòu)建bulk RNA的一個對象SummarizedExperiment，然后轉(zhuǎn)換格式)
> se <- SummarizedExperiment(assays = list(counts = counts))
# as =》 Force an Object to Belong to a Class 
> sce <- as(se,"SingleCellExperiment")

上面都是自己創(chuàng)造的模擬數(shù)據(jù)集式镐，下面看一個真的數(shù)據(jù)集

來自：Adult mouse cortical cell taxonomy revealed by single cell transcriptomics(2016反镇，Nature Neuroscience)

library(scRNAseq);data(allen)
# 其中包含了379個成年雄性小鼠不同神經(jīng)元的視覺皮層細(xì)胞數(shù)據(jù)
# allen數(shù)據(jù)集是一個SummarizedExperiment對象，需要先轉(zhuǎn)換
> sce <- as(allen,"SingleCellExperiment")
> sce
class: SingleCellExperiment 
dim: 20908 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts
  cufflinks_fpkm rsem_counts
  rsem_tpm
rownames(20908): 0610007P14Rik
  0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028
  SRR2140022 ... SRR2139341
  SRR2139336
colData names(22): NREADS
  NALIGNED ... Animal.ID
  passes_qc_checks_s
reducedDimNames(0):
spikeNames(0):
# size factors
> sizeFactors(sce) <- colSums(assay(sce))

質(zhì)控

目的：get out of technical issues and biases

利用的數(shù)據(jù)集來自：Batch effects and the effective design of single-cell gene expression studies. (2017, Tung et al. )

6 RNA-sequencing datasets per individual : 3 bulk & 3 single-cell (on C1 plates) and each dataset was sequenced in a different batch【Fluidigm C1 platform的低通量的微流控技術(shù)娘汞，能在捕獲細(xì)胞時可視化的控制empty wells or doublets】

Three C1 96 well-integrated fluidic circuit (IFC) replicates were collected from each of the three Yoruba individuals.

表達(dá)矩陣下載：ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE77nnn/GSE77288/suppl/GSE77288_reads-raw-single-per-sample.txt.gz

library("data.table")
library("dplyr")
reads_raw <- fread("GSE77288_reads-raw-single-per-sample.txt")
setDF(reads_raw)
dim(reads_raw)
# 得到metadata
anno <- reads_raw %>%
    select(individual:well) %>%
    mutate(batch = paste(individual, replicate, sep = "."),
           sample_id = paste(batch, well, sep = "."))
head(anno)
dim(anno)
# 得到表達(dá)矩陣
reads <- reads_raw %>%
    select(starts_with("ENSG"), starts_with("ERCC")) %>%
    t
colnames(reads) <- anno$sample_id
reads[1:5, 1:5]
dim(reads)
# 構(gòu)建對象
sce <- SingleCellExperiment(assays = list(counts = reads),
                            rowData = data.frame(gene = rownames(reads)),
                            colData = anno)

# 去掉樣本中表達(dá)量均為0的基因
sce2 <- sce[apply(counts(sce), 1, sum) >0,]
dim(sce);dim(sce2)#大約1700個基因不表達(dá)
# 添加Spike信息
isSpike(sce2, "ERCC") <- grepl("^ERCC-", rownames(sce2))

> sce2
class: SingleCellExperiment 
dim: 18726 864 
metadata(0):
assays(1): counts
rownames(18726): ENSG00000237683 ENSG00000187634 ... ERCC-00170
  ERCC-00171
rowData names(9): gene is_feature_control ... total_counts
  log10_total_counts
colnames(864): NA19098.r1.A01 NA19098.r1.A02 ... NA19239.r3.H11
  NA19239.r3.H12
colData names(41): individual replicate ...
  pct_counts_in_top_200_features_ERCC
  pct_counts_in_top_500_features_ERCC
reducedDimNames(0):
spikeNames(1): ERCC

# Calculate QC metrics
library(scater)
# 常用的feature control：such as ERCC spike-in sets or mitochondrial genes
sce2 <- calculateQCMetrics(sce2, feature_controls = list(ERCC = isSpike(sce2, "ERCC")))
# 結(jié)果會在colData這一部分增加許多列歹茶，幫助過濾低表達(dá)基因和樣本
names(colData(sce2))

進(jìn)行簡單的細(xì)胞過濾：可以根據(jù)文庫大小(Library size)和批次(Batch)進(jìn)行

# 1.Filter cells with Library size
# 設(shè)置閾值(文庫大小基本在2M reads以上，可以設(shè)置文庫大小的5%作為過濾)
# 閾值的設(shè)定沒有固定標(biāo)準(zhǔn)你弦，需要考慮具體數(shù)據(jù)集
threshold <- 100000
# 畫密度圖
plot(density(sce2$total_counts), main = 'Density - total_counts')
abline(v = threshold)
# 要保留的細(xì)胞
keep <- sce2$total_counts > threshold
# 統(tǒng)計過濾信息
table(keep)

FALSE  TRUE 
    6   858 

# 2.Filter cells with Batch[結(jié)果見下圖]
# 方法一：
cDataFrame <- as.data.frame(colData(sce2))
# plot cell data
ggplot(cDataFrame, aes(x = total_counts, y = total_counts_ERCC, col = batch)) + 
    geom_point()
# 方法二：
plotColData(
    sce2,
    y = "total_counts_ERCC",
    x = "total_counts",
    colour_by = "batch")
# 要保留的細(xì)胞
> keep <- sce$batch != "NA19098.r2"
> table(keep)
keep
FALSE  TRUE 
   96   768

下圖中惊豺，x軸是total counts，y軸是僅有ERCC基因的total counts禽作，每個點代表一個細(xì)胞尸昧，細(xì)胞根據(jù)批次上色。

可以看到橙色點(r2)的ERCC占比更高旷偿，這個批次在原文中也體現(xiàn)出來了烹俗，和其他batch非常不同

ERCC是已知序列和數(shù)量的合成mRNA，RNA的讀數(shù)將提供樣本間差異的信息萍程。spike-in control 在確定測序過程中的empty Wells和的dead cells有重要作用幢妄，高的ERCC含量與低質(zhì)量數(shù)據(jù)相關(guān)，并且通常是排除的標(biāo)準(zhǔn)茫负。

Filter cells with Batch

細(xì)胞過濾后磁浇，然后進(jìn)行基因過濾 (因為有的基因雖然表達(dá)，但是卻是在被過濾掉的細(xì)胞中朽褪，因此也算作無效)

目的就是保留至少在1個細(xì)胞中表達(dá)量不為0

filter_genes <- apply(counts(sce2), 1, function(x){
    length(x[x >= 1]) >= 1
})
table(filter_genes)
# 雖然這里不存在低表達(dá)基因

下次開始標(biāo)準(zhǔn)化

歡迎關(guān)注我們的公眾號～_～　　
我們是兩個農(nóng)轉(zhuǎn)生信的小碩置吓，打造生信星球无虚，想讓它成為一個不拽術(shù)語、通俗易懂的生信知識平臺衍锚。需要幫助或提出意見請后臺留言或發(fā)送郵件到jieandze1314@gmail.com

Welcome to our bioinfoplanet!