我們開始看看作者是怎么標準化數(shù)據(jù)的。考慮到各種混雜因素對單個細胞中定量表達水平的影響宗苍,混雜因素包括dropout events的頻率、單個細胞中的低表達量mRNA薄榛、不同類型細胞捕獲的高可變性或技術噪聲讳窟,因此我們需要對數(shù)據(jù)QC后的數(shù)據(jù)進行標準化。接下來我們的分析結(jié)果會與文章有所不同敞恋,但是我們重點要學的是對數(shù)據(jù)的處理過程丽啡,所以我們繼續(xù)學習下去。
總的來說標準化分為三個步驟進行硬猫,
(1)使用Census algorithm將TPM值轉(zhuǎn)換為relative counts(用Monocle包的函數(shù)Relative2abs)
(2)對relative counts進行去卷積策略(scran包)补箍;
(3)用RUVSeq去除scrannormalized Census counts 額外的變異噪音。
#我們對過濾好的1326個細胞啸蜜,總共13280個基因重新進行QC
#同樣地坑雅,把cacultecell改成addPerCellQC
sceset <- sceset_all[keep.genes, ]
sceset<- addPerCellQC(sceset,exprs_values = "exprs",
detection_limit = min_thresh_log_tpm,flatten=FALSE,
subsets= list("regular" = which(colData(sceset)$pool_H12 == 0), "pool" = which(colData(sceset)$pool_H12 == 1)))
#創(chuàng)建對象
min_thresh_tpm <- 1
pd_2 <- new("AnnotatedDataFrame", data = as.data.frame(colData(sceset)))
featureNames(pd_2) <- rownames(pd_2)
fd_2 <- new("AnnotatedDataFrame", data = as.data.frame(rowData(sceset)))
featureNames(fd_2) <- rowData(sceset)$gene_short_name
HSMM <- newCellDataSet(assays(sceset)$tpm, phenoData = pd_2, featureData = fd_2,
expressionFamily = tobit(), lowerDetectionLimit = min_thresh_tpm)
HSMM <- detectGenes(HSMM, min_expr = min_thresh_tpm)
colData(sceset)$num_genes_expressed <- pData(HSMM)$num_genes_expressed
接下來這一步,使用relative2abs函數(shù)中的Census algorithm把TPM values 轉(zhuǎn)換為 relative counts衬横。也就是每個細胞中的TPM值除以每個細胞中估計的Total mRNA裹粤,重新計算TPM值。這樣子蜂林,TPM值被轉(zhuǎn)換成負二項分布的relative Census counts遥诉,這樣相比于edgeR,DESeq2和SCDE 能更好地反映了單個細胞中mRNA的實際數(shù)量悉尾。我們看看monocle包中對relative2abs描述
對于我們來說突那,學到了另一種轉(zhuǎn)錄組測序數(shù)據(jù)處理方法。比如是否可以拿GEO/TCGA或者自己的數(shù)據(jù)TPM/FPKM轉(zhuǎn)換成relative counts构眯,然后進行下游分析愕难?這個留給大家自己探索了....
# Transform the TPM values into relative counts
rpc_matrix <- relative2abs(HSMM)
#創(chuàng)建對象
HSMM <- newCellDataSet(as(as.matrix(rpc_matrix), "sparseMatrix"),
phenoData = phenoData(HSMM),
featureData = featureData(HSMM),
lowerDetectionLimit = min_thresh_log_tpm,
expressionFamily = negbinomial.size())
#歸一化處理
HSMM <- estimateSizeFactors(HSMM)
HSMM <- estimateDispersions(HSMM)
assays(sceset)$monocle <- as.matrix(exprs(HSMM))
assays(sceset)$log2_tpm <- log2(assays(sceset)$tpm + 1)
assays(sceset)$log2_monocle <- as.matrix(log2(exprs(HSMM) + 1))
assays(sceset)$log2_tpm_median <- t(scale(t(assays(sceset)$exprs), scale = FALSE))
接下來這一步會過將1326個samples過濾掉137個,剩下1189個samples惫霸。但是我們按照作者的代碼跑的話猫缭,無法過濾掉這137個樣本。我的理解是可能是在前面的calculateQCMetrics換成了addPerCellQC壹店,QC的算法略有改變猜丹。另外一個是對于結(jié)果會有一點影響。
# 將每個細胞的表達量除以其因子大小硅卢,減少了干擾效應
exprs_filtered <- t(t(exprs(HSMM)/pData(HSMM)$Size_Factor))
#挑選基因值不是0的射窒,此時沒有1個samples的值為0藏杖。
nz_genes <- which(exprs_filtered != 0)
#再次log2轉(zhuǎn)換
exprs_filtered[nz_genes] <- log2(exprs_filtered[nz_genes] + 1)
assays(sceset)$norm_log2_monocle <- as.matrix(exprs_filtered)
#此時,我們得到的samples仍然是1376個
length(colnames(as.matrix(exprs_filtered)))
> length(colnames(as.matrix(exprs_filtered)))
[1] 1326
接下來使用scan包來remove additional sources of unwanted variation脉顿。
RUVgb函數(shù)認為在樣本中管家基因具有近似恒定表達量蝌麸,并使用廣義線性模型去回歸管家基因表達的估計變異。
## scran normalization on Monocle counts
sceset_mon_norm <- sceset
assays(sceset_mon_norm)$counts <- assays(sceset)$monocle
clusters_mon_norm <- quickCluster(sceset_mon_norm, min.size = 100)
sceset_mon_norm <- computeSumFactors(sceset_mon_norm, cluster = clusters_mon_norm, sizes = 200, positive = TRUE)
if (length(which(sizeFactors(sceset_mon_norm) == 0)) > 0) {
sceset_mon_norm <- sceset_mon_norm[,-which(sizeFactors(sceset_mon_norm) == 0)]
assays(sceset_mon_norm)$log2_tpm_median <- t(scale(t(assays(sceset_mon_norm)$exprs), scale = FALSE))
}
sceset_mon_norm <- logNormCounts(sceset_mon_norm)
assays(sceset_mon_norm)$exprs <- assays(sceset_mon_norm)$logcounts
很遺憾艾疟,我們沒能揪出那 137個samples来吩,將1326個samples過濾成1189個samples,但是作者的代碼大概意思懂了就好蔽莱,畢竟單細胞發(fā)展這么快弟疆,相應的R包更新也快,有些函數(shù)里面的算法可能已經(jīng)改變了盗冷,這樣的話有時候就會沒法完全重現(xiàn)作者的數(shù)據(jù)怠苔。作者也在給出了這樣的說明:
# expressionFamily = gaussianff()) not supported from VGAM 1.0.6 onwards
# original code (working slightly differently now with the more recent versions of quickCluster and computeSumFactors from scran)
# due to updates/changes in quickCluster and computeSumFactors (at a minimum), the resulting values are slightly different
# than the ones obtained when carrying out the original analysis. to reproduce the results from the paper, we can simply
# read in the normalized data and the phenoData, as preprocessed in the original analysis
可喜的是,我們學習到了作者是如何對數(shù)據(jù)進行質(zhì)控的仪糖,這一部分的內(nèi)容在方法那里有詳細的描述嘀略,有興趣的小伙伴們可以去鉆研。并且我們可以拿到作者上傳到GSE上的已經(jīng)normalized data來繼續(xù)下游分析學習乓诽,下一節(jié)將學習細胞亞群的注釋和一張復雜的熱圖繪制,我會結(jié)合文獻的內(nèi)容跟大家分享咒程。