劉小澤寫于19.3.26
陌生的領(lǐng)域就像一張白紙呕臂,怎么畫就看自己了
前言
關(guān)于單細(xì)胞測序破托,我們已經(jīng)知道了它和bulk RNA-seq的區(qū)別,簡單說就是scRNA體現(xiàn)異質(zhì)性(個體)歧蒋,bulk體現(xiàn)平均程度(總體)土砂,因此利用bulk RNA不能區(qū)分一個樣本中的不同細(xì)胞類型
做scRNA的最大的好處就是可以"對癥下藥"(當(dāng)然這也是最理想的情況=》什么細(xì)胞配合什么種類或者劑量的藥物)
Biomarker:A biological molecule found in blood, other body fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease. A biomarker may be used to see how well the body responds to a treatment for a disease or condition. Also called molecular marker and signature molecule.
一般在下機(jī)后會得到一個表達(dá)矩陣,行為基因谜洽,列為細(xì)胞樣本(根據(jù)測序平臺不同細(xì)胞數(shù)不同萝映,目前商業(yè)化平臺有10X和BD,一般都能達(dá)到2000-8000細(xì)胞量阐虚,可以覆蓋絕大部分組織的single cell群體類型)序臂。
還包括兩個協(xié)變量(covariates:解釋變量,不為實驗者所操縱实束,但仍影響實驗結(jié)果):gene-level(比如GC含量)和cell-level(比如細(xì)胞測序的批次信息)奥秆。
單細(xì)胞和bulk轉(zhuǎn)錄組表達(dá)矩陣主要有兩點不同:
bulk的表達(dá)矩陣是reads比對到基因的數(shù)目;單細(xì)胞的表達(dá)矩陣是reads比對到某個細(xì)胞的某個基因的數(shù)目(barcode用來區(qū)分細(xì)胞咸灿;UMI用來區(qū)分轉(zhuǎn)錄本/基因)
單細(xì)胞存在許多的0构订,有兩個可能:第一種是真的基因在細(xì)胞中不表達(dá)(參與細(xì)胞分化的基因并不是在每個cell cycle都表達(dá));第二種是技術(shù)誤差析显,基因?qū)嶋H表達(dá)但測序沒有測到(也就是"dropouts")
從頭開始慢慢理解
這一部分暫時不討論細(xì)節(jié)鲫咽,先看Big Picture
探索表達(dá)矩陣
The matrix of counts is the number of sequenced reads aligned to each gene and each sample
假設(shè)這個矩陣有10個基因(Lamp5
, Fam19a1
, Cnr1
, Rorb
, Sparcl1
, Crym
, Ddah1
, Lmo3
, Serpine2
, Pde1a
) 和5個細(xì)胞樣本(SRR2140028
, SRR2140022
, SRR2140055
, SRR2140083
, SRR2139991
)
表達(dá)矩陣如下:
> counts
SRR2140028 SRR2140022 SRR2140055 SRR2140083 SRR2139991
Lamp5 10 11 0 0 8
Fam19a1 11 9 0 6 0
Cnr1 0 0 0 12 0
Rorb 0 8 0 0 0
Sparcl1 0 7 0 13 0
Crym 8 9 10 12 5
Ddah1 16 0 0 8 0
Lmo3 0 0 8 13 0
Serpine2 8 0 0 0 0
Pde1a 0 14 8 11 0
# head of count matrix
counts[1:3, 1:3]
# count of specific gene and cell
reads <- counts["Cnr1", "SRR2140055"]
# overall proportion of zero counts
pZero <- mean(counts==0)
# cell library size
libsize <- colSums(counts)
The cell coverage is the average number of expressed genes that are greater than zero in each cell.
目的:ggplot2畫出cell coverage
假設(shè)目前有一個關(guān)于細(xì)胞樣本的協(xié)變量(cell-level),內(nèi)容如下:
> cell_info
names batch patient
SRR2140028 SRR2140028 1 a
SRR2140022 SRR2140022 1 a
SRR2140055 SRR2140055 2 b
SRR2140083 SRR2140083 2 c
SRR2139991 SRR2139991 2 c
先計算coverage谷异,然后加在cell_info的最后
# find cell coverage 計算count大于0的平均值
# 先將表達(dá)量為0的值賦值一個NA分尸,然后計算時忽略掉它們
is.na(counts)=counts==0
coverage <- colMeans(counts,na.rm=T)
cell_info$coverage <- coverage
library(ggplot2)
ggplot(cell_info, aes(x = names, y = coverage)) +
geom_point() +
ggtitle('Cell Coverage') +
xlab('Cell Name') +
ylab('Coverage')
看看大體的流程
測序=》single cell 發(fā)展歷史
圖片來自:Exponential scaling of single-cell RNA-seq in the past decade. (2018 Nature Protocols)
2009-2017不到十年,已報道的細(xì)胞研究數(shù)目就從1個飛躍至一百萬個
盡管這些方法各不相同歹嘹,但原理的主要不同之處有兩個:分子定量(Quantification)和捕獲方法(Capture)
- RNA分子定量(Quantification):主要是兩種技術(shù)(full-length和tag-based)
The full-length:基于全長箩绍,試圖獲取每個轉(zhuǎn)錄本覆蓋度一致且獨特的全長片段;
The tag-based:基于標(biāo)簽探針尺上,捕獲RNA的5'或3'端序列 - 捕獲方法(Capture):不僅決定了細(xì)胞的篩選方式以及實驗的通量材蛛,同時也間接反映了測序外的其他信息圆到。目前主要有三種捕獲方法:
微孔 microwell
微流 microfluidic
微滴 droplet
詳細(xì)看這里:https://hemberg-lab.github.io/scRNA.seq.course/introduction-to-single-cell-rna-seq.html
目前兩大商業(yè)化單細(xì)胞測序公司:
10X Genomics Chromium
利用微流控、油滴包裹和barcode標(biāo)記實現(xiàn)高通量細(xì)胞捕獲和標(biāo)記卑吭,一次最多捕獲8萬細(xì)胞芽淡。特點就是:細(xì)胞通量高,建庫成本低豆赏,捕獲周期短
主要用于細(xì)胞分型和標(biāo)記因子的鑒定挣菲,從而實現(xiàn)對細(xì)胞群體的劃分與細(xì)胞群體間基因表達(dá)差異的檢測,此外該技術(shù)還可以預(yù)測細(xì)胞分化與研究發(fā)育軌
跡掷邦。BD Rhapsody
基于微孔平臺白胀,通過剛性磁珠進(jìn)行單細(xì)胞捕獲(捕獲后磁珠可以保存3個月以上);利用成像系統(tǒng)對單細(xì)胞分離進(jìn)行質(zhì)控抚岗。特點是:細(xì)胞樣本損傷谢蚋堋;質(zhì)控可視化宣蔚,監(jiān)控雙細(xì)胞比例向抢;細(xì)胞捕獲較穩(wěn)定;可以結(jié)合蛋白表達(dá)量分析轉(zhuǎn)錄組胚委;可以設(shè)計panel分析特定基因
質(zhì)控
目的就是減少技術(shù)誤差笋额,去掉有問題的細(xì)胞,一般反映細(xì)胞是否有問題主要看兩個方面:
- library size:能比對到每個細(xì)胞的總reads數(shù)(一個ibrary指一個細(xì)胞)
- cell coverage:每個細(xì)胞中表達(dá)基因的平均數(shù)
標(biāo)準(zhǔn)Workflow
來自:Bioconductor workflow for single-cell RNA sequencing
可以從下圖看到:
第一步就是基因和細(xì)胞層次的標(biāo)準(zhǔn)化(normalization):消除與研究無關(guān)的生物和技術(shù)影響篷扩;
第二步是降維兄猩,基因數(shù)量決定維度,需要從J個基因降到K個基因鉴未,實現(xiàn)了兩個事情枢冤,一是維度降低可以更好掌控數(shù)據(jù),二是保留感興趣的特征同時減少了背景噪音铜秆;
第三步是聚類分群淹真,按照降維后的矩陣(K x N,N是細(xì)胞數(shù)量)连茧,這一步給了每個細(xì)胞一個cluster number核蘸;
第四步是差異分析,找到biomarker啸驯,也就是兩組細(xì)胞的表達(dá)差異基因
慢慢翻開細(xì)枝末節(jié)
加載客扎、創(chuàng)建、獲取single-cell數(shù)據(jù)集
單細(xì)胞分析利用的幾個包都是高度整合的罚斗,里面需要用到許多S4對象徙鱼,只有理解好這些對象才能看懂分析流程
首先了解一下SingleCellExperiment對象
介紹:
https://www.bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html
SingleCellExperiment(SCE)是一個S4對象,作者是Davide Risso and Aaron Lun 。官網(wǎng)稱它是light-weight container for single-cell genomics data
袱吆,看出它是一個容器厌衙,可以裝單細(xì)胞組學(xué)數(shù)據(jù)。那么就知道和我們?nèi)粘=佑|的一般表達(dá)矩陣不同绞绒,它肯定還包含一些表型信息(對象Object 簡單理解就是:包含多個數(shù)據(jù)框婶希、矩陣或者列表,用到哪個提取哪個蓬衡。數(shù)據(jù)框取子集我們用$
饲趋,對象取子集我們要用@
)
之前知道了單細(xì)胞數(shù)據(jù)一般有三個矩陣(表達(dá)矩陣raw counts絮重、協(xié)變量信息gene-level & cell-level metadata),這個對象就可以將這三個同時存儲在一個容器中芜抒,另外還有一些其他重要信息纠修,比如:spike-in info,dimentionality reduction coordinates官紫,size factors for each cell
下載、加載包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleCellExperiment", version = "3.8")
library(SingleCellExperiment)
創(chuàng)建對象
三個要素:表達(dá)矩陣、行名讨衣、列名
> # 先設(shè)計矩陣(假設(shè)counts值服從泊松分布)
> counts <- matrix(rpois(8,lambda = 10),ncol = 2,nrow = 4)
> rownames(counts) <- c("Lamp5","Fam19a1","Cnr1","Rorb")
> colnames(counts) <- c("SRR2140021","SRR2140023")
> counts
SRR2140021 SRR2140023
Lamp5 4 6
Fam19a1 9 13
Cnr1 14 5
Rorb 15 7
# 創(chuàng)建對象第一種方法
#第一個參數(shù)assays就是取表達(dá)矩陣的list
> sce <- SingleCellExperiment(assays = list(counts = counts),
+ rowData = data.frame(gene = rownames(counts)),
+ colData = data.frame(cell = colnames(counts)))
> sce
class: SingleCellExperiment
dim: 4 2
metadata(0):
assays(1): counts
rownames(4): Lamp5 Fam19a1 Cnr1 Rorb
rowData names(1): gene
colnames(2): SRR2140021 SRR2140023
colData names(1): cell
reducedDimNames(0):
spikeNames(0):
# 創(chuàng)建對象第二種方法(先構(gòu)建bulk RNA的一個對象SummarizedExperiment,然后轉(zhuǎn)換格式)
> se <- SummarizedExperiment(assays = list(counts = counts))
# as =》 Force an Object to Belong to a Class
> sce <- as(se,"SingleCellExperiment")
上面都是自己創(chuàng)造的模擬數(shù)據(jù)集式镐,下面看一個真的數(shù)據(jù)集
來自:Adult mouse cortical cell taxonomy revealed by single cell transcriptomics(2016反镇,Nature Neuroscience)
library(scRNAseq);data(allen)
# 其中包含了379個成年雄性小鼠不同神經(jīng)元的視覺皮層細(xì)胞數(shù)據(jù)
# allen數(shù)據(jù)集是一個SummarizedExperiment對象,需要先轉(zhuǎn)換
> sce <- as(allen,"SingleCellExperiment")
> sce
class: SingleCellExperiment
dim: 20908 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts
cufflinks_fpkm rsem_counts
rsem_tpm
rownames(20908): 0610007P14Rik
0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028
SRR2140022 ... SRR2139341
SRR2139336
colData names(22): NREADS
NALIGNED ... Animal.ID
passes_qc_checks_s
reducedDimNames(0):
spikeNames(0):
# size factors
> sizeFactors(sce) <- colSums(assay(sce))
質(zhì)控
目的:get out of technical issues and biases
利用的數(shù)據(jù)集來自:Batch effects and the effective design of single-cell gene expression studies. (2017, Tung et al. )
6 RNA-sequencing datasets per individual : 3 bulk & 3 single-cell (on C1 plates) and each dataset was sequenced in a different batch【Fluidigm C1 platform的低通量的微流控技術(shù)娘汞,能在捕獲細(xì)胞時可視化的控制empty wells or doublets】
Three C1 96 well-integrated fluidic circuit (IFC) replicates were collected from each of the three Yoruba individuals.
library("data.table")
library("dplyr")
reads_raw <- fread("GSE77288_reads-raw-single-per-sample.txt")
setDF(reads_raw)
dim(reads_raw)
# 得到metadata
anno <- reads_raw %>%
select(individual:well) %>%
mutate(batch = paste(individual, replicate, sep = "."),
sample_id = paste(batch, well, sep = "."))
head(anno)
dim(anno)
# 得到表達(dá)矩陣
reads <- reads_raw %>%
select(starts_with("ENSG"), starts_with("ERCC")) %>%
t
colnames(reads) <- anno$sample_id
reads[1:5, 1:5]
dim(reads)
# 構(gòu)建對象
sce <- SingleCellExperiment(assays = list(counts = reads),
rowData = data.frame(gene = rownames(reads)),
colData = anno)
# 去掉樣本中表達(dá)量均為0的基因
sce2 <- sce[apply(counts(sce), 1, sum) >0,]
dim(sce);dim(sce2)#大約1700個基因不表達(dá)
# 添加Spike信息
isSpike(sce2, "ERCC") <- grepl("^ERCC-", rownames(sce2))
> sce2
class: SingleCellExperiment
dim: 18726 864
metadata(0):
assays(1): counts
rownames(18726): ENSG00000237683 ENSG00000187634 ... ERCC-00170
ERCC-00171
rowData names(9): gene is_feature_control ... total_counts
log10_total_counts
colnames(864): NA19098.r1.A01 NA19098.r1.A02 ... NA19239.r3.H11
NA19239.r3.H12
colData names(41): individual replicate ...
pct_counts_in_top_200_features_ERCC
pct_counts_in_top_500_features_ERCC
reducedDimNames(0):
spikeNames(1): ERCC
# Calculate QC metrics
library(scater)
# 常用的feature control:such as ERCC spike-in sets or mitochondrial genes
sce2 <- calculateQCMetrics(sce2, feature_controls = list(ERCC = isSpike(sce2, "ERCC")))
# 結(jié)果會在colData這一部分增加許多列歹茶,幫助過濾低表達(dá)基因和樣本
names(colData(sce2))
進(jìn)行簡單的細(xì)胞過濾:可以根據(jù)文庫大小(Library size)和批次(Batch)進(jìn)行
# 1.Filter cells with Library size
# 設(shè)置閾值(文庫大小基本在2M reads以上,可以設(shè)置文庫大小的5%作為過濾)
# 閾值的設(shè)定沒有固定標(biāo)準(zhǔn)你弦,需要考慮具體數(shù)據(jù)集
threshold <- 100000
# 畫密度圖
plot(density(sce2$total_counts), main = 'Density - total_counts')
abline(v = threshold)
# 要保留的細(xì)胞
keep <- sce2$total_counts > threshold
# 統(tǒng)計過濾信息
table(keep)
FALSE TRUE
6 858
# 2.Filter cells with Batch[結(jié)果見下圖]
# 方法一:
cDataFrame <- as.data.frame(colData(sce2))
# plot cell data
ggplot(cDataFrame, aes(x = total_counts, y = total_counts_ERCC, col = batch)) +
geom_point()
# 方法二:
plotColData(
sce2,
y = "total_counts_ERCC",
x = "total_counts",
colour_by = "batch")
# 要保留的細(xì)胞
> keep <- sce$batch != "NA19098.r2"
> table(keep)
keep
FALSE TRUE
96 768
下圖中惊豺,x軸是total counts,y軸是僅有ERCC基因的total counts禽作,每個點代表一個細(xì)胞尸昧,細(xì)胞根據(jù)批次上色。
可以看到橙色點(r2)的ERCC占比更高旷偿,這個批次在原文中也體現(xiàn)出來了烹俗,和其他batch非常不同
ERCC是已知序列和數(shù)量的合成mRNA,RNA的讀數(shù)將提供樣本間差異的信息萍程。spike-in control 在確定測序過程中的empty Wells和的dead cells有重要作用幢妄,高的ERCC含量與低質(zhì)量數(shù)據(jù)相關(guān),并且通常是排除的標(biāo)準(zhǔn)茫负。
細(xì)胞過濾后磁浇,然后進(jìn)行基因過濾 (因為有的基因雖然表達(dá),但是卻是在被過濾掉的細(xì)胞中朽褪,因此也算作無效)
目的就是保留至少在1個細(xì)胞中表達(dá)量不為0
filter_genes <- apply(counts(sce2), 1, function(x){
length(x[x >= 1]) >= 1
})
table(filter_genes)
# 雖然這里不存在低表達(dá)基因
下次開始標(biāo)準(zhǔn)化
歡迎關(guān)注我們的公眾號~_~
我們是兩個農(nóng)轉(zhuǎn)生信的小碩置吓,打造生信星球无虚,想讓它成為一個不拽術(shù)語、通俗易懂的生信知識平臺衍锚。需要幫助或提出意見請后臺留言或發(fā)送郵件到jieandze1314@gmail.com